Save all new target URLs from "use URLs linking on same verified URLs" to ONE separate

hans51 · November 2013

Save all new target URLs from "use URLs linking on same verified URLs" to ONE separate file for further processing before use for submissions

first a few facts based on recent days experiences to better understand the final feature request

this feature of harvesting target URLs by SER certainly is one of the best ever, much faster than scraping, even faster than SB.

a few days back I had again in ONE Tier2 some 100k URLs, a day later that increased substantially and after 2-3 days
I found 540'000+ URLs in the "show left target URLs". lots of gravatar.com duplicate domains

I saved all separate for cleanup first
next day (yesterday) I found again some 520k URLs in same Tier
today some 881'000+ new target URLs

results in 3 days = some 1.8+ million new target URLs as a "side-effect" without scraping!

Here below is a details situation of the problem + plus what I did and how to solve by a new feature.

I love to point out that the quality of these harvested new target is far above the quality of scraping by SB and / or SER!

Having an entire target URL file filled with ONLY such URLs however slows down very substantially the overall LpM

why?

I took my first 540k URL and did some testing

1.
a brief look at the URLs showed some 80'000 URLs all from gravatar.com = hence my assumption that blog comments are among others used to harvest

* I exported ALL 540k URLs to SB
then
* removed gravatar.com = approx 460k URL left
* removed duplicate domains = approx 140k URLs left

then I split the remaining some 140k URLs into junks of 10k

and then I used 2 procedures to proceed the 14 files I now had:

A -
direct import of several junks of 10k into SER = import + identify

result after several time 10K all similar:

= some 60% were deleted by SER
= some 40% were OK saved to file and then imported into projects

B -
other few junks to 10K URLs were imported to SB - using addon live check

- live check of SB of course is much faster and needs much less CPU / SER resources than an import using SER engine filters

Result =- approx half of all URLs dead
remaining live URLs I imported into SER with engine check

the combination of ALL first into a fast live-check and THEN into = import + identify seems to be by far the most time / resource efficient way!

there still are a number of errors when actually submitting to that cleaned up list, such as forms missing, no engine matches, registration failed, etc
but the remaining URLs/sites are absolutely excellent = more domain diversity + well balance engine diversification with focus on high quality article/blog sites.
I run now for some 3 days ONLY on the processed / cleaned up targets of a.m. harvest = I can harvest in one day all URLs I need in about 2 days of non-stop submissions

without any scraping! even without use of the large global list = just directly imported NEW targets

hence the URLs are of excellent quality = mostly good instant verify blog, article or wiki sites in a nice balance / almost identical proportions (of course may vary user by user)

the problem to solve is = feature request

to save ALL such harvested from ALL projects and Tiers into ONE single file to have all URLs in on location for easier re-processing before final use to submit
may be remove gravatar.com in the harvest filters

the current version of having all such target URLs imported into each separate file may lead to a "pollution" of clean target URLs by new URLs that contain high % of error, no matching engine or dead URLs and to a slow down of LPM without knowing WHY

in addition many such URLs may be in smaller quantity and never recognized as such imported by SER and "lost" in the masses of other URLs

then users pre-process all harvested URLs individually, depending on YOUR CPU resources and bandwidth available

in my experience the mentioned 2-step process is by far the fastest:

1. live check all at max speed possible (even on my 0.5 to 1 MBps daytime www connection, I can do an SB livecheck while using SER to submit.

while a direct import of uncleaned URLs would require ALL resources and thus a stop of all suubmissions

2. final step = import + identify = safe to file
then import each file into a project as needed

the reason for ONE single file to accumulate ALL such harvested URLs is simple.
some or many may have much smaller numbers of harvested URLs that visually get lost among the other URLs imported from global lists. hence a primary clean up would be impossible.

in addition when all such URLs = each day may be different in number, you would have to look manually into each project file "show left target URLs", then identify the NON-globallist URLs, copy and export for processing
every day
= lots of time consuming work
without that, those with limited resources (CPU or www bandwidth) may suffer from low LpM = my LpM without prior clean up apprx 20-25% of normal LpM

another advantage of one centralized file with ALL new target URLs for further clean up and precessing is that you have ALL new cleaned-up target URLs in one file and can split it into pieced to THEN import the new targets where NEW URLs are needed,

the reson why I have all in ONE project only
I figured out that only that one has a larger number of blog-comments and guestbooks

all others have articles, blogs and wikis by far mostly

it also seems that MOST or almost all of the new target URLs appear during the RE-verification process ((with many dark blue log lines

usually at CPU OF SOME 25%

MY RE-verification are done manually ONCE daily around midnight when I have max bandwidth (up to 2MBps)
that might be the reason why I discovered the source of URLs and how to optimize it
may be - just MAY be - that separate processing of RE-verification makes it easier with free CPU % to collect new URLs ??

during normal SER submission with CPU at 99% there might be no resources free for harvesting and saving these URLs ??

Sven · November 2013

I could add that feature in options->advanced->tools->search for site lists on verified urls. + you can mark it to save it to a seperate file.

hans51 · November 2013

that would be great

I am sure too many - specially the younger SER users struggling for new targets give too little attention to this new feature and how to optimize it for greatest benefits

gooner · November 2013

Nice idea, i love that feature - Now it will be even more effective

Save all new target URLs from "use URLs linking on same verified URLs" to ONE separate

Comments