Completely different processing speed for imported URLs vs identified URLs

hans51 · October 2013

2 points and possible bugs:

extreme apparent difference in URL processing
URLs in show URLs > show left target URLs end all with a | {pipe}

I use mostly global site list for submission
and SB to create lists to be imported into SER

direct import into projects increases LpM
using verified list from global list reduces LpM

from a common sense point of view, I would expect the opposite, verified URLs are already tested for matching engines and PR = thus should be faster or at least never slower in use ...

1.
the last 1+ days I made repeated tests showing apparently extreme processing speed DIFFERENCE of target URLs depending from where SER takes them.

I made a special project to test lists
UN-checked ALL engines except

ExpressEngine
Drupal Blog
MediaWiki

ALL other projects stopped
all resources free just for this one project testing list-efficiency

all 3 engines are limited to use targets from global list verified only = all other lists UN-checked

- NO SE
- NO other sources of target URLs

many thousands of verified URLs are available for each of 3 a.m. engines in the list verified

SER set to 33 threads

SER however runs the above sometimes with as low as 0-3 threads - even if thread number shows any larger number, almost nothing moves in log scroll

show URLs > show left target URLs
usually has some 58 URLs and refills from "verified" correctly

reduced threads down to 1 or eve 0 sometimes over extended period of time (many minutes) for no apparent reason -= no email parsing or so

it appears that every single URL takes max time set for time out (100 seconds to 120mx)
LpM over extended period of time ( hrs) 0.0x

as compare to
untested / unverified URL lists imported into project directly

when I import a few thousand unverified URLs - pre-filtered - from SB directly into a project
the processing speed is MANY times faster - LpM with same resources 5-12 vs 0.0x for verified URL-list

2.
URLs in show URLs > show left target URLs end all with a | {pipe}
I checked all left target URLs from ALL projects and Tiers
and all have on most URLs a | as URL end - the few without pipe ending might be the few URLs on those lists without any | + PR value at the end ??

is there a tech purpose / reason for such a pipe after the URL list in show URLs > show left target URLs
or
is it a bug possibly slowing down URL processing ?

Sven · October 2013

There is a problem with understanding here I think. Using the global site list is not taking one URL after the other but a random one from the list. And it is not doing that all the time to not stress your CPU. It can happen of course that the same URLs have been proceed before and so the LpM is down as the currently got URLs have been submitted to already.

hans51 · October 2013

@sven

I monitored the log scroll many hrs to understand possible problems or differences compared to direct import of targets into project

messages such as:

already parsed = NEVER occurs because too many unique domain (some several ten thousand on those a.m. tests of 3 engines

but mostly there are messages such as NEW url ...

other errors during log scroll occur just about as they occur on directly imported UN-verified URLs into project = there however I have may be 1 third or so errors like "no matching engine"
and yet a direct import is so much faster than the use of global list verified

we have here a difference in speed between LpM 5-12 vs 0.0x = that is a factor of 100-300 faster than global list verified ....

CPU usage during that time is in the lower 2 digits %
memory use may be 100 MB out of 2GB available

btw
rather than random url FROM verified list
it would make more sense to take oldest first,
like take bottom URLs
and refill top of list

like a bank account = to always have ready to use URLs already filtered and allocated to engines

and one more point
to make sure I have NO already parsed but pure processing speed,
I made a
modify project > delete target URL history

ACTUALLY with a total of some 320'000+ unique domain URLs "on stock" in my verified for all engines,
that even should not have been possible
but was just to give the SER FREE run

if however I select ONLY global list "submitted", then speed is up to "normal"

I still belief there is something really wrong in performance,

because if NO other source of URLs available but that one list,
then I would expect SER to make free use of that list as fast as possible to fill all xx threads for submission

above OP point

2.
URLs in show URLs > show left target URLs end all with a | {pipe}

still open ???

RayBan · October 2013

hans51 how do you sort your SB list ? By obl, PR, platforms, alive check ?
I tried some of these filters, but in altogether it takes a lot of time.

Wouldn't SER be able to identify the platform faster than SB ? I know you use SER just for posting, but I am curios about the list building. Just starting it and would value your advice a lot.

hans51 · October 2013

@RayBan

I use Linux = external filtering
using a number of shell scripts to sort my SB list using footprints from SER

my typical daily harvest on very slow ISP is about 750k URLs,

then dedup URL and filter and I get some several thousand up to a few ten thousand target URLs per daily harvest

the filtering of all (750k URL) takes may be 1 or 2 seconds
if I would import into SER = may be a day or more !!
750k URL with majority being NO target URL at all = is a lot of wasted resources and lots of SER using then up to 99% CPU

after filtering = then import into SER projects directly

end-result:

best case about 85% good
worst case about 65% good

typical submit vs verify is usually
worst 30% verify on a list
best 60-80% verify

it widely depends on daily lists used to harvest external URLs
I am still in testing phase as I use SB only since a few weeks and still keep experimenting to make the best out of the very poor www situation here in KH

filtering includes foot prints PLUS bad-domain list already

for me that is very efficient compared to SER scraping

RayBan · October 2013

@hans51 - such speed is incredible. If you could create smth. like that for WIn7 users like me, I would be ready to pay.

hans51 · October 2013

@RayBan

keep in mind that all filtering is done offline on PATH of harvested URLs
just use footprints inurl:xyz without the SE code part = i.e. xyz instead of inurl:xyz
in parsing+filtering / the URLs harvested

Me I am Linux pro = NO bills and gates = nothing to do with win7 all my life except now for SER use strictly

learn dual boot or second machine and get linux to do it and LEARN how to do it on linux
or
learn whatever coding language needed to do on win7

I have full time work NOTHING at all to do with coding = no solution for your job
but
you search online for coder,
search for
"coder for rent" win

may be there is a free in-official addon for SB on the market

... if what you do is worth your LIFE = then also worth any effort to learn whatever you need to succeed
there are things no money can buy
but can be provided by God's blessings

hans51 · October 2013

@sven

it still is about huge difference in URL processing depending on FROM WHERE SER takes its URLs = identified vs directly imported

to conclude and deepen my tests, I took ALL identified URLs from a.m. 3 engines

ExpressEngine
Drupal Blog
MediaWiki

= 33720 URLs

a little clean up by SB = 32399 URLs

then randomize and split into URL lists of 3240 URLs

and imported these URL lists DIRECTLY into each project and T

and on ALL projects and T = ALL other sources of targets switched OFF

then I ran all until imported list empty

2 results stand out

when having only ONE project active, the number of threads very soon went down from original 33 to 1 until the 100 second time out occurred, then restart with all threads and reducing threads as before. CPU usage during this phase = from maximum of lower xx% to as low as 1-2% ...
when having ALL 7 projects and T active = until all lists were empty (minutes ago) the overall LpM was approx 1.5 = normal for regular NON-SB lists

during second option, CPU was busy from around mid xx % with repeated short burst up to 99% / memory 100 - max 250 MB (from 2GB available) = SER working and steadily scrolling log all the time

An LpM difference between a.m. 0.0x (as far as I remember LpM was around 0.02+) vs 1.x is approx 50+ faster for the very same URLs

processing of directly imported URLs into projects = faster by approx factor 50+ than import from global list verified

as a conclusion of above final test, I still belief there is a bug in the processing procedure of global lists vs direct imported lists.

direct imported lists as above all needed to get PR again and be matched to engine and still some 50+ times faster ??

hans51 · October 2013

@RayBan

when looking for yourself for such a tool to do offline pre-filtering of SB output
keep in mind that this actually may belong into a SB feature request and not to SER
and
of course I can NOT detect URLs that have an engine installed on domain level and thus all domain.TLD/ on my SB lists get removed unless they have an URL / folder/ subsolder / page matching an SER inurl footprint

SER already has such filtering included - but for online use filtering web sites = SER does a perfect job
already

RayBan · October 2013

@ron could you share how do you filter your list ? @hans51 has great method, but unfortunatelly it is way too complicated for me at least for time being.

ron · October 2013

@RayBan I'm not bringing in lists so I'm not doing this kind of filtering.

Completely different processing speed for imported URLs vs identified URLs

Comments