Save all new target URLs from "use URLs linking on same verified URLs" to ONE separate
Save all new target URLs from "use URLs linking on same verified URLs" to ONE separate file for further processing before use for submissions
first a few facts based on recent days experiences to better understand the final feature request
this feature of harvesting target URLs by SER certainly is one of the best ever, much faster than scraping, even faster than SB.
a few days back I had again in ONE Tier2 some 100k URLs, a day later that increased substantially and after 2-3 days
I found 540'000+ URLs in the "show left target URLs". lots of gravatar.com duplicate domains
I saved all separate for cleanup first
next day (yesterday) I found again some 520k URLs in same Tier
today some 881'000+ new target URLs
results in 3 days = some 1.8+ million new target URLs as a "side-effect" without scraping!
Here below is a details situation of the problem + plus what I did and how to solve by a new feature.
I love to point out that the quality of these harvested new target is far above the quality of scraping by SB and / or SER!
Having an entire target URL file filled with ONLY such URLs however slows down very substantially the overall LpM
why?
I took my first 540k URL and did some testing
1.
a brief look at the URLs showed some 80'000 URLs all from gravatar.com = hence my assumption that blog comments are among others used to harvest
* I exported ALL 540k URLs to SB
then
* removed gravatar.com = approx 460k URL left
* removed duplicate domains = approx 140k URLs left
then I split the remaining some 140k URLs into junks of 10k
and then I used 2 procedures to proceed the 14 files I now had:
A -
direct import of several junks of 10k into SER = import + identify
result after several time 10K all similar:
= some 60% were deleted by SER
= some 40% were OK saved to file and then imported into projects
B -
other few junks to 10K URLs were imported to SB - using addon live check
- live check of SB of course is much faster and needs much less CPU / SER resources than an import using SER engine filters
Result =- approx half of all URLs dead
remaining live URLs I imported into SER with engine check
the combination of ALL first into a fast live-check and THEN into = import + identify seems to be by far the most time / resource efficient way!
there still are a number of errors when actually submitting to that cleaned up list, such as forms missing, no engine matches, registration failed, etc
but the remaining URLs/sites are absolutely excellent = more domain diversity + well balance engine diversification with focus on high quality article/blog sites.
I run now for some 3 days ONLY on the processed / cleaned up targets of a.m. harvest = I can harvest in one day all URLs I need in about 2 days of non-stop submissions without any scraping! even without use of the large global list = just directly imported NEW targets
hence the URLs are of excellent quality = mostly good instant verify blog, article or wiki sites in a nice balance / almost identical proportions (of course may vary user by user)
the problem to solve is = feature request
in addition many such URLs may be in smaller quantity and never recognized as such imported by SER and "lost" in the masses of other URLs
then users pre-process all harvested URLs individually, depending on YOUR CPU resources and bandwidth available
in my experience the mentioned 2-step process is by far the fastest:
1. live check all at max speed possible (even on my 0.5 to 1 MBps daytime www connection, I can do an SB livecheck while using SER to submit.
while a direct import of uncleaned URLs would require ALL resources and thus a stop of all suubmissions
2. final step = import + identify = safe to file
then import each file into a project as needed
the reason for ONE single file to accumulate ALL such harvested URLs is simple.
some or many may have much smaller numbers of harvested URLs that visually get lost among the other URLs imported from global lists. hence a primary clean up would be impossible.
in addition when all such URLs = each day may be different in number, you would have to look manually into each project file "show left target URLs", then identify the NON-globallist URLs, copy and export for processing
every day
= lots of time consuming work
without that, those with limited resources (CPU or www bandwidth) may suffer from low LpM = my LpM without prior clean up apprx 20-25% of normal LpM
another advantage of one centralized file with ALL new target URLs for further clean up and precessing is that you have ALL new cleaned-up target URLs in one file and can split it into pieced to THEN import the new targets where NEW URLs are needed,
the reson why I have all in ONE project only
I figured out that only that one has a larger number of blog-comments and guestbooks
all others have articles, blogs and wikis by far mostly
it also seems that MOST or almost all of the new target URLs appear during the RE-verification process ((with many dark blue log lines usually at CPU OF SOME 25%
MY RE-verification are done manually ONCE daily around midnight when I have max bandwidth (up to 2MBps)
that might be the reason why I discovered the source of URLs and how to optimize it
may be - just MAY be - that separate processing of RE-verification makes it easier with free CPU % to collect new URLs ??
during normal SER submission with CPU at 99% there might be no resources free for harvesting and saving these URLs ??
first a few facts based on recent days experiences to better understand the final feature request
this feature of harvesting target URLs by SER certainly is one of the best ever, much faster than scraping, even faster than SB.
a few days back I had again in ONE Tier2 some 100k URLs, a day later that increased substantially and after 2-3 days
I found 540'000+ URLs in the "show left target URLs". lots of gravatar.com duplicate domains
I saved all separate for cleanup first
next day (yesterday) I found again some 520k URLs in same Tier
today some 881'000+ new target URLs
results in 3 days = some 1.8+ million new target URLs as a "side-effect" without scraping!
Here below is a details situation of the problem + plus what I did and how to solve by a new feature.
I love to point out that the quality of these harvested new target is far above the quality of scraping by SB and / or SER!
Having an entire target URL file filled with ONLY such URLs however slows down very substantially the overall LpM
why?
I took my first 540k URL and did some testing
1.
a brief look at the URLs showed some 80'000 URLs all from gravatar.com = hence my assumption that blog comments are among others used to harvest
* I exported ALL 540k URLs to SB
then
* removed gravatar.com = approx 460k URL left
* removed duplicate domains = approx 140k URLs left
then I split the remaining some 140k URLs into junks of 10k
and then I used 2 procedures to proceed the 14 files I now had:
A -
direct import of several junks of 10k into SER = import + identify
result after several time 10K all similar:
= some 60% were deleted by SER
= some 40% were OK saved to file and then imported into projects
B -
other few junks to 10K URLs were imported to SB - using addon live check
- live check of SB of course is much faster and needs much less CPU / SER resources than an import using SER engine filters
Result =- approx half of all URLs dead
remaining live URLs I imported into SER with engine check
the combination of ALL first into a fast live-check and THEN into = import + identify seems to be by far the most time / resource efficient way!
there still are a number of errors when actually submitting to that cleaned up list, such as forms missing, no engine matches, registration failed, etc
but the remaining URLs/sites are absolutely excellent = more domain diversity + well balance engine diversification with focus on high quality article/blog sites.
I run now for some 3 days ONLY on the processed / cleaned up targets of a.m. harvest = I can harvest in one day all URLs I need in about 2 days of non-stop submissions without any scraping! even without use of the large global list = just directly imported NEW targets
hence the URLs are of excellent quality = mostly good instant verify blog, article or wiki sites in a nice balance / almost identical proportions (of course may vary user by user)
the problem to solve is = feature request
- to save ALL such harvested from ALL projects and Tiers into ONE single file to have all URLs in on location for easier re-processing before final use to submit
- may be remove gravatar.com in the harvest filters
in addition many such URLs may be in smaller quantity and never recognized as such imported by SER and "lost" in the masses of other URLs
then users pre-process all harvested URLs individually, depending on YOUR CPU resources and bandwidth available
in my experience the mentioned 2-step process is by far the fastest:
1. live check all at max speed possible (even on my 0.5 to 1 MBps daytime www connection, I can do an SB livecheck while using SER to submit.
while a direct import of uncleaned URLs would require ALL resources and thus a stop of all suubmissions
2. final step = import + identify = safe to file
then import each file into a project as needed
the reason for ONE single file to accumulate ALL such harvested URLs is simple.
some or many may have much smaller numbers of harvested URLs that visually get lost among the other URLs imported from global lists. hence a primary clean up would be impossible.
in addition when all such URLs = each day may be different in number, you would have to look manually into each project file "show left target URLs", then identify the NON-globallist URLs, copy and export for processing
every day
= lots of time consuming work
without that, those with limited resources (CPU or www bandwidth) may suffer from low LpM = my LpM without prior clean up apprx 20-25% of normal LpM
another advantage of one centralized file with ALL new target URLs for further clean up and precessing is that you have ALL new cleaned-up target URLs in one file and can split it into pieced to THEN import the new targets where NEW URLs are needed,
the reson why I have all in ONE project only
I figured out that only that one has a larger number of blog-comments and guestbooks
all others have articles, blogs and wikis by far mostly
it also seems that MOST or almost all of the new target URLs appear during the RE-verification process ((with many dark blue log lines usually at CPU OF SOME 25%
MY RE-verification are done manually ONCE daily around midnight when I have max bandwidth (up to 2MBps)
that might be the reason why I discovered the source of URLs and how to optimize it
may be - just MAY be - that separate processing of RE-verification makes it easier with free CPU % to collect new URLs ??
during normal SER submission with CPU at 99% there might be no resources free for harvesting and saving these URLs ??
Tagged:
Comments
I am sure too many - specially the younger SER users struggling for new targets give too little attention to this new feature and how to optimize it for greatest benefits