Skip to content

Scraping Question

edited February 2013 in Need Help
I have recently bought Scrapebox and have started building my first list. So far I have 137k URL's (De-Duped) and growing.  These are all based on footprints etc for GSA.

Do I need to limit the size in any way or can I just keep building a massive list? Also, do you guys just build one big list or do you save separate files for all the different footprints that you scrape.

Thanks

Comments

  • scrapebox is limited to 1 million urls per list. but they are stored in the harvester folder of scrapebox for the case you scraped more then 1m before deduping.

    however, split the deduped lists in different parts and import each part directly to your bottom tiers in SER.
  • 2 questions:

    1) for dedup, was wondering if we should delete duplicate domain?  i have just been deleting duplicate urls

    2) project settings > data > tools > import target urls... is there a limit to how many urls can be added?

    i just imported a list of 120k+ (duplicate urls removed)... but was wondering if there is an import limit of some sort because scrapebox can come back with pretty big lists :)

    thanks for the help!!!
  • SvenSven www.GSA-Online.de

    1) This has no influence on speed at least. But for people who like to have everything sorted and organized to it's perfection, you should delete duplicate domains for all engines except blogs/image comments.

    2) no, you can add as many as you want. Not all URLs are loaded but just 1MB of it and when done, the next 1MB.

  • here in the forum 200k was testet.
  • hi! knowing that there are no limits... i imported 2 million urls for ser to sort through

    been over 12 hours now and i can still see it's going through the urls

    @Sven - if i closed the program... 

    a) does it lose the imported urls or
    b) will it continue from where it left of (i.e. continue to process from the last url) or
    c) start from the first imported url again

    thanks
  • SvenSven www.GSA-Online.de
    b :)
  • one more question @Sven regarding dedup domains or urls...  re:"This has no influence on speed at least. But for people who like to
    have everything sorted and organized to it's perfection, you should
    delete duplicate domains for all engines except blogs/image comments."

    so for posting, urls matter only for
    - blog comments
    - image comments

    but i thought they would matter for trackbacks as well? please confirm :)

    my lists are getting too large if i just dedup urls :(

  • SvenSven www.GSA-Online.de
    True, trackback, pingback and image comments should be handled the same.
  • edited March 2013
    great

    just 4 platforms that needs to keep duplicate domains
    - blogs comments
    - image comments
    - trackback
    - pingback

    thanks @Sven
  • AlexRAlexR Cape Town
    @sven - Doesn't it make sense to have the correct platforms selected by default for dedupe URL and dedupe domain? Otherwise, all users are always deselecting and selecting one by one. Those who want to edit default can do so, but most would want remove URL dedupe to only apply on blog, image, guestbook and trackbacks. 
  • SvenSven www.GSA-Online.de
    next version preselects them for you
  • AlexRAlexR Cape Town
    Thanks! That will save time!!!
Sign In or Register to comment.