Completely different processing speed for imported URLs vs identified URLs
2 points and possible bugs:
and SB to create lists to be imported into SER
direct import into projects increases LpM
using verified list from global list reduces LpM
from a common sense point of view, I would expect the opposite, verified URLs are already tested for matching engines and PR = thus should be faster or at least never slower in use ...
1.
the last 1+ days I made repeated tests showing apparently extreme processing speed DIFFERENCE of target URLs depending from where SER takes them.
I made a special project to test lists
UN-checked ALL engines except
all resources free just for this one project testing list-efficiency
all 3 engines are limited to use targets from global list verified only = all other lists UN-checked
- NO SE
- NO other sources of target URLs
many thousands of verified URLs are available for each of 3 a.m. engines in the list verified
SER set to 33 threads
SER however runs the above sometimes with as low as 0-3 threads - even if thread number shows any larger number, almost nothing moves in log scroll
show URLs > show left target URLs
usually has some 58 URLs and refills from "verified" correctly
reduced threads down to 1 or eve 0 sometimes over extended period of time (many minutes) for no apparent reason -= no email parsing or so
it appears that every single URL takes max time set for time out (100 seconds to 120mx)
LpM over extended period of time ( hrs) 0.0x
as compare to
untested / unverified URL lists imported into project directly
when I import a few thousand unverified URLs - pre-filtered - from SB directly into a project
the processing speed is MANY times faster - LpM with same resources 5-12 vs 0.0x for verified URL-list
2.
URLs in show URLs > show left target URLs end all with a | {pipe}
I checked all left target URLs from ALL projects and Tiers
and all have on most URLs a | as URL end - the few without pipe ending might be the few URLs on those lists without any | + PR value at the end ??
is there a tech purpose / reason for such a pipe after the URL list in show URLs > show left target URLs
or
is it a bug possibly slowing down URL processing ?
- extreme apparent difference in URL processing
- URLs in show URLs > show left target URLs end all with a | {pipe}
and SB to create lists to be imported into SER
direct import into projects increases LpM
using verified list from global list reduces LpM
from a common sense point of view, I would expect the opposite, verified URLs are already tested for matching engines and PR = thus should be faster or at least never slower in use ...
1.
the last 1+ days I made repeated tests showing apparently extreme processing speed DIFFERENCE of target URLs depending from where SER takes them.
I made a special project to test lists
UN-checked ALL engines except
- ExpressEngine
- Drupal Blog
- MediaWiki
all resources free just for this one project testing list-efficiency
all 3 engines are limited to use targets from global list verified only = all other lists UN-checked
- NO SE
- NO other sources of target URLs
many thousands of verified URLs are available for each of 3 a.m. engines in the list verified
SER set to 33 threads
SER however runs the above sometimes with as low as 0-3 threads - even if thread number shows any larger number, almost nothing moves in log scroll
show URLs > show left target URLs
usually has some 58 URLs and refills from "verified" correctly
reduced threads down to 1 or eve 0 sometimes over extended period of time (many minutes) for no apparent reason -= no email parsing or so
it appears that every single URL takes max time set for time out (100 seconds to 120mx)
LpM over extended period of time ( hrs) 0.0x
as compare to
untested / unverified URL lists imported into project directly
when I import a few thousand unverified URLs - pre-filtered - from SB directly into a project
the processing speed is MANY times faster - LpM with same resources 5-12 vs 0.0x for verified URL-list
2.
URLs in show URLs > show left target URLs end all with a | {pipe}
I checked all left target URLs from ALL projects and Tiers
and all have on most URLs a | as URL end - the few without pipe ending might be the few URLs on those lists without any | + PR value at the end ??
is there a tech purpose / reason for such a pipe after the URL list in show URLs > show left target URLs
or
is it a bug possibly slowing down URL processing ?
Tagged:
Comments
I monitored the log scroll many hrs to understand possible problems or differences compared to direct import of targets into project
messages such as:
already parsed = NEVER occurs because too many unique domain (some several ten thousand on those a.m. tests of 3 engines
but mostly there are messages such as NEW url ...
other errors during log scroll occur just about as they occur on directly imported UN-verified URLs into project = there however I have may be 1 third or so errors like "no matching engine"
and yet a direct import is so much faster than the use of global list verified
we have here a difference in speed between LpM 5-12 vs 0.0x = that is a factor of 100-300 faster than global list verified ....
CPU usage during that time is in the lower 2 digits %
memory use may be 100 MB out of 2GB available
btw
rather than random url FROM verified list
it would make more sense to take oldest first,
like take bottom URLs
and refill top of list
like a bank account = to always have ready to use URLs already filtered and allocated to engines
and one more point
to make sure I have NO already parsed but pure processing speed,
I made a
modify project > delete target URL history
ACTUALLY with a total of some 320'000+ unique domain URLs "on stock" in my verified for all engines,
that even should not have been possible
but was just to give the SER FREE run
if however I select ONLY global list "submitted", then speed is up to "normal"
I still belief there is something really wrong in performance,
because if NO other source of URLs available but that one list,
then I would expect SER to make free use of that list as fast as possible to fill all xx threads for submission
above OP point
2.
URLs in show URLs > show left target URLs end all with a | {pipe}
still open ???
I tried some of these filters, but in altogether it takes a lot of time.
Wouldn't SER be able to identify the platform faster than SB ? I know you use SER just for posting, but I am curios about the list building. Just starting it and would value your advice a lot.
I use Linux = external filtering
using a number of shell scripts to sort my SB list using footprints from SER
my typical daily harvest on very slow ISP is about 750k URLs,
then dedup URL and filter and I get some several thousand up to a few ten thousand target URLs per daily harvest
the filtering of all (750k URL) takes may be 1 or 2 seconds
if I would import into SER = may be a day or more !!
750k URL with majority being NO target URL at all = is a lot of wasted resources and lots of SER using then up to 99% CPU
after filtering = then import into SER projects directly
end-result:
best case about 85% good
worst case about 65% good
typical submit vs verify is usually
worst 30% verify on a list
best 60-80% verify
it widely depends on daily lists used to harvest external URLs
I am still in testing phase as I use SB only since a few weeks and still keep experimenting to make the best out of the very poor www situation here in KH
filtering includes foot prints PLUS bad-domain list already
for me that is very efficient compared to SER scraping
keep in mind that all filtering is done offline on PATH of harvested URLs
just use footprints inurl:xyz without the SE code part = i.e. xyz instead of inurl:xyz
in parsing+filtering / the URLs harvested
Me I am Linux pro = NO bills and gates = nothing to do with win7 all my life except now for SER use strictly
learn dual boot or second machine and get linux to do it and LEARN how to do it on linux
or
learn whatever coding language needed to do on win7
I have full time work NOTHING at all to do with coding = no solution for your job
but
you search online for coder,
search for
"coder for rent" win
may be there is a free in-official addon for SB on the market
... if what you do is worth your LIFE = then also worth any effort to learn whatever you need to succeed
there are things no money can buy
but can be provided by God's blessings
it still is about huge difference in URL processing depending on FROM WHERE SER takes its URLs = identified vs directly imported
to conclude and deepen my tests, I took ALL identified URLs from a.m. 3 engines
= 33720 URLs
a little clean up by SB = 32399 URLs
then randomize and split into URL lists of 3240 URLs
and imported these URL lists DIRECTLY into each project and T
and on ALL projects and T = ALL other sources of targets switched OFF
then I ran all until imported list empty
2 results stand out
during second option, CPU was busy from around mid xx % with repeated short burst up to 99% / memory 100 - max 250 MB (from 2GB available) = SER working and steadily scrolling log all the time
An LpM difference between a.m. 0.0x (as far as I remember LpM was around 0.02+) vs 1.x is approx 50+ faster for the very same URLs
processing of directly imported URLs into projects = faster by approx factor 50+ than import from global list verified
as a conclusion of above final test, I still belief there is a bug in the processing procedure of global lists vs direct imported lists.
direct imported lists as above all needed to get PR again and be matched to engine and still some 50+ times faster ??
when looking for yourself for such a tool to do offline pre-filtering of SB output
keep in mind that this actually may belong into a SB feature request and not to SER
and
of course I can NOT detect URLs that have an engine installed on domain level and thus all domain.TLD/ on my SB lists get removed unless they have an URL / folder/ subsolder / page matching an SER inurl footprint
SER already has such filtering included - but for online use filtering web sites = SER does a perfect job
already