Skip to content

Completely different processing speed for imported URLs vs identified URLs

2 points and possible bugs:
  1. extreme apparent difference in URL processing 
  2. URLs in show URLs > show left target URLs end all with a | {pipe}
I use mostly global site list for submission
and SB to create lists to be imported into SER

direct import into projects increases LpM
using verified list from global list reduces LpM

from a common sense point of view, I would expect the opposite, verified URLs are already tested for matching engines and PR = thus should be faster or at least never slower in use ...

1.
the last 1+ days I made repeated tests showing apparently extreme processing speed DIFFERENCE of target URLs depending from where SER takes them.

I made a special project to test lists
UN-checked ALL engines except
  • ExpressEngine
  • Drupal Blog
  • MediaWiki
ALL other projects stopped
all resources free just for this one project testing list-efficiency

all 3 engines are limited to use targets from global list verified only = all other lists UN-checked

- NO SE
- NO other sources of target URLs

many thousands of verified URLs are available for each of 3 a.m. engines in the list verified

SER set to 33 threads

SER however runs the above sometimes with as low as 0-3 threads - even if thread number shows any larger number, almost nothing moves in log scroll

show URLs > show left target URLs
usually has some 58 URLs and refills from "verified" correctly

reduced threads down to 1 or eve 0 sometimes over extended period of time (many minutes) for no apparent reason -= no email parsing or so

it appears that every single URL takes max time set for time out (100 seconds to 120mx)
LpM over extended period of time ( hrs) 0.0x

as compare to
untested / unverified URL lists imported into project directly

when I import a few thousand unverified URLs - pre-filtered - from SB directly into a project
the processing speed is MANY times faster - LpM with same resources 5-12 vs 0.0x for verified URL-list


2.
URLs in show URLs > show left target URLs end all with a | {pipe}
I checked all left target URLs from ALL projects and Tiers
and all have on most URLs a | as URL end - the few without pipe ending might be the few URLs on those lists without any | + PR value at the end ??

is there a tech purpose / reason for such a pipe after the URL list in show URLs > show left target URLs
or
is it a bug possibly slowing down URL processing ?


Tagged:

Comments

  • SvenSven www.GSA-Online.de
    There is a problem with understanding here I think. Using the global site list is not taking one URL after the other but a random one from the list. And it is not doing that all the time to not stress your CPU. It can happen of course that the same URLs have been proceed before and so the LpM is down as the currently got URLs have been submitted to already.
  • edited October 2013
    @sven

    I monitored the log scroll many hrs to understand possible problems or differences compared to direct import of targets into project

    messages such as:

    already parsed = NEVER occurs because too many unique domain (some several ten thousand on those a.m. tests of 3 engines

    but mostly there are messages such as NEW url ...

    other errors during log scroll occur just about as they occur on directly imported UN-verified URLs into project = there however I have may be 1 third or so errors like "no matching engine"
    and yet a direct import is so much faster than the use of  global list verified

    we have here a difference in speed between LpM 5-12 vs 0.0x = that is a factor of 100-300 faster than global list verified ....

    CPU usage during that time is in the lower 2 digits %
    memory use may be 100 MB out of 2GB available

    btw
    rather than random url
    FROM verified list
    it would make more sense to take oldest first,
    like take bottom URLs
    and refill top of list

    like a bank account = to always have ready to use URLs already filtered and allocated to engines

    and one more point
    to make sure I have NO already parsed but pure processing speed,
    I made a
    modify project > delete target URL history

    ACTUALLY with a total of some 320'000+ unique domain URLs "on stock" in my verified for all engines,
    that even should not have been possible
    but was just to give the SER FREE run

    if however I select ONLY global list "submitted", then speed is up to "normal"

    I still belief there is something really wrong in performance,

    because if NO other source of URLs available but that one list,
    then I would expect SER to make free use of that list as fast as possible to fill all xx threads for submission

    above OP point

    2.
    URLs in show URLs > show left target URLs end all with a | {pipe}

    still open ???
  • hans51 how do you sort your SB list ? By obl, PR, platforms, alive check ?
    I tried some of these filters, but in altogether it takes a lot of time.

    Wouldn't SER be able to identify the platform faster than SB ? I know you use SER just for posting, but I am curios about the list building. Just starting it and would value your advice a lot.

  • @RayBan

    I use Linux = external filtering
    using a number of shell scripts to sort my SB list using footprints from SER

    my typical daily harvest on very slow ISP is about 750k URLs,

    then dedup URL and filter and I get some several thousand up to a few ten thousand target URLs per daily harvest

    the filtering of all (750k URL) takes may be 1 or 2 seconds
    if I would import into SER = may be a day or more !!
    750k URL with majority being NO target URL at all = is a lot of wasted resources and lots of SER using then up to 99% CPU

    after filtering = then import into SER projects directly

    end-result:

    best case about 85% good
    worst case about 65% good

    typical submit vs verify is usually
    worst 30% verify on a list
    best 60-80% verify

    it widely depends on daily lists used to harvest external URLs
    I am still in testing phase as I use SB only since a few weeks and still keep experimenting to make the best out of the very poor www situation here in KH

    filtering includes foot prints PLUS bad-domain list already

    for me that is very efficient compared to SER scraping
  • @hans51 - such speed is incredible. If you could create smth. like that for WIn7 users like me, I would be ready to pay.
  • @RayBan

    keep in mind that all filtering is done offline on PATH of harvested URLs
    just use footprints inurl:xyz without the SE code part = i.e. xyz instead of inurl:xyz
    in parsing+filtering / the URLs harvested

    Me I am Linux pro = NO bills and gates = nothing to do with win7 all my life except now for SER use strictly

    learn dual boot or second machine and get linux to do it and LEARN how to do it on linux
    or
    learn whatever coding language needed to do on win7

    I have full time work NOTHING at all to do with coding = no solution for your job
    but
    you search online for coder,
    search for
    "coder for rent" win

    may be there is a free in-official addon for SB on the market

    ... if what you do is worth your LIFE = then also worth any effort to learn whatever you need to succeed
    there are things no money can buy
    but can be provided by God's blessings
  • @sven

    it still is about huge difference in URL processing depending on FROM WHERE SER  takes its URLs = identified vs directly imported

    to conclude and deepen my tests, I took ALL identified URLs from a.m. 3 engines
    • ExpressEngine
    • Drupal Blog
    • MediaWiki

    = 33720 URLs

    a little clean up by SB = 32399 URLs

    then randomize and split into URL lists of 3240 URLs

    and imported these URL lists DIRECTLY into each project and T

    and on ALL projects and T = ALL other sources of targets switched OFF

    then I ran all until imported list empty

    2 results stand out

    1. when having only ONE project active, the number of threads very soon went down from original 33 to 1 until the 100 second time out occurred, then restart with all threads and reducing threads as before. CPU usage during this phase = from maximum of lower xx% to as low as 1-2% ...
    2. when having ALL 7 projects and T active = until all lists were empty (minutes ago) the overall LpM was approx 1.5 = normal for regular NON-SB lists

    during second option, CPU was busy from around mid xx % with repeated short burst up to 99% / memory 100 - max 250 MB (from 2GB available) = SER working and steadily scrolling log all the time

    An LpM difference between a.m. 0.0x (as far as I remember LpM was around 0.02+) vs 1.x is approx 50+ faster for the very same URLs

    processing of directly imported URLs into projects = faster by approx factor 50+ than import from global list verified

    as a conclusion of above final test, I still belief there is a bug in the processing procedure of global lists vs direct imported lists.

    direct imported lists as above all needed to get PR again and be matched to engine  and still some 50+ times faster ??

  • @RayBan

    when looking for yourself for such a tool to do offline pre-filtering of SB output
    keep in mind that this actually may belong into a SB feature request and not to SER
    and
    of course I can NOT detect URLs that have an engine installed on domain level and thus all domain.TLD/ on my SB lists get removed unless they have an URL / folder/ subsolder / page matching an SER inurl footprint

    SER already has such filtering included - but for online use filtering web sites = SER does a perfect job
    already
  • @ron could you share how do you filter your list ? @hans51 has great method, but unfortunatelly it is way too complicated for me at least for time being.
  • ronron SERLists.com
    @RayBan I'm not bringing in lists so I'm not doing this kind of filtering.
Sign In or Register to comment.