Skip to content

Import URLs is taking forever

edited February 2013 in Need Help
I have never used the "Import URLs" function before. I scraped a couple million urls using other tools and footprints for GSA SER sites. I am trying a small portion of them (under 20K) just to see how it works. It has been running for two hours and has "Identified: 484" and "Unknown: 5555". This seems terrible to me. Is this normal speed for this process? If not, how can I figure what I am doing wrong?
Tagged:

Comments

  • How many threads are you running and at what time out?
  • Yeah...shudda put that...100 threads...30 secs timeout
  • I do have GScraper running as well (with 100 threads), but CPU utilization is only about 20% and network utilization is very low (100kbits/sec or less)
  • sorry....just went back and checked...GScraper is at 50 threads.

  • edited February 2013
    Dont import lists.. takes too long time and you cant use anything else in the program while its doing that.. just split it in batches of 50.000 urls and feed them to your Tier2/3 projects, works much faster and the good ones will sort into your 'submitted' or "identified" list under Options->Tools automatically if you activated the checkbox there
  • It would seem to me that one the main reasons for this option would be to remove some of the "scraping" burden from GSA SER. I should get better results (faster anyway) if the program does not have to do the scraping process. More resources could be used for posting.

    Whether SER does the scraping or it is done with something else, SER still has the do the "identify" process. Whether it does it all in one batch or little by little, it should not "work much faster" one way or the other. It should be going through the same logic - therefore, taking the same amount of time.

    Even if I feed "batches" into Tier 2/3 as you suggest, it will still have to go through the same "identfy" process that it is doing now. And as I mentioned in the original post, I am only trying to process 18K urls now.
  • Yes its piss slow. @Sven should work on idenfying URL's at a much faster rate. 
  • AlexRAlexR Cape Town
    @DavidA2 - totally agree! This feature is super slow. What would also be nice is a background feature that rechecks the sitelist entries to ensure they are all still active. 
  • 18K targets only? Put that into a project with 150 Threads and it will take 10min or less to go through. Import that to identify it will take 3-4 times longer
  • glad i found this thread and it's not just my comp problem

    have my settings at 250 threads + html timeout 130

    and it's taking HOURS to get through my list from sb...
  • great program!  just this feature is a little slow
  • @squirrelhunter

    "Dont import lists.. takes too long time and you cant use anything else
    in the program while its doing that.. just split it in batches of 50.000
    urls and feed them to your Tier2/3 projects, works much faster and the
    good ones will sort into your 'submitted' or "identified" list under
    Options->Tools automatically if you activated the checkbox there"

    top tip!!!

    just realized what you said there...

    and using this in different projects, it will allows us to 'import and sort' multi threaded as well!!
  • I use Excel to manage my scraped lists. Have two tabs with master list on one and newley scraped URLs on the other. Then I do a Vlookup to find the new ones. These new ones go into SER and also over to the master list tab. The processing takes about 15 minutes in Excel to run thru the list but it beats having to wait for SER to run thru the list.
  • @scp, thanks for the tip

    excel has a limit on 60k+ rows

    do you keep duplicate domain (or multiple urls from the same domain)?
  • scpscp
    edited March 2013
    Don't believe everything M$ tell you ;)

    2007 and newer versions can handle A LOT of rows. Like I said, it takes a while to process them in a formula. You can't move the mouse or anything once it starts processing. Kick off the formula and walk away for 10 minutes.
  • lol, i am using 2007 and i got capped at 60+ rows

    time to upgrade to a new version i guess... ahhaahha
  • you know that you can use scrapebox instead of excel for sorting the new URLs, right?
    excel sounds to me a bit complicated??! but maybe i'm wrong.
  • @ozz didn't think of that...

    you mean after getting a harvested list, we import our master list and remove dup? that will end up with a master unique list though

    how can we use sb to just sort out the new unique v.s. old and imported urls?
  • OzzOzz
    edited March 2013
    IIRC you have to save your fresh list to file first.

    then you need to import your master list to scrapebox and import the fresh list with "select the URL lists to compare". 
    save the list with the new URLs and export it to SER. then you need to merge the new URLs with your masterlist.

    done.
  • thanks @ozz

    will try that after my harvesting run is done :)
  • "import your master list to scrapebox and import the fresh list with "select the URL lists to compare""

    confirmed - works like a charm!
  • The "select URL lists to compare" doesn't seem to be working. I did a test like this:

    master file contains 2 urls:

    www.1.com
    www.2.com

    Test scrape file contains 4 urls:

    www.1.com
    www.2.com
    www.3.com
    www.4.com

    I then imported into Scrapebox using the process @ozz mentioned and it didn't work.

    I'm looking for a way to tell me that www.3.com and www.4.com are unique and remove the duplicates. Scrapebox isn't doing that for me. Am I missing a step?
  • @scp it works, here is what i did:

    newly scraped list is in the Harvested box

    no need to save it (it's already in the harvester_sessions)

    then

    import url list > select the url lists to compare > browse to your MASTER list

    Harvested box only has new found urls :)
  • edited March 2013
    @scp This will give you what you need: http://jura.wi.mit.edu/bioc/tools/compare.php
  • adding a million lines to the online tool might force it to go timeout?  just a thought
  • Come on guys! This is one of those instances where you should be able to figure things out and find solutions for yourself! In case you haven't heard about it, Google has this cool search tool that you can use to find things on the internet.

    Oh...don't want to do that...okay I'll spoon feed you this one time. Here are three other options available to compare files and remove duplicates.

    http://textmechanic.com/ - also lots of other cool tools
    http://wonderwebware.com/duplicatefinder/
    http://www.scrapebox.com/free-dupe-remove



  • sb ftw...
Sign In or Register to comment.