Import URLs is taking forever

DavidA2 · February 2013

I have never used the "Import URLs" function before. I scraped a couple million urls using other tools and footprints for GSA SER sites. I am trying a small portion of them (under 20K) just to see how it works. It has been running for two hours and has "Identified: 484" and "Unknown: 5555". This seems terrible to me. Is this normal speed for this process? If not, how can I figure what I am doing wrong?

doubleup · February 2013

How many threads are you running and at what time out?

DavidA2 · February 2013

Yeah...shudda put that...100 threads...30 secs timeout

DavidA2 · February 2013

I do have GScraper running as well (with 100 threads), but CPU utilization is only about 20% and network utilization is very low (100kbits/sec or less)

DavidA2 · February 2013

sorry....just went back and checked...GScraper is at 50 threads.

squirrelhunter · February 2013

Dont import lists.. takes too long time and you cant use anything else in the program while its doing that.. just split it in batches of 50.000 urls and feed them to your Tier2/3 projects, works much faster and the good ones will sort into your 'submitted' or "identified" list under Options->Tools automatically if you activated the checkbox there

DavidA2 · February 2013

It would seem to me that one the main reasons for this option would be to remove some of the "scraping" burden from GSA SER. I should get better results (faster anyway) if the program does not have to do the scraping process. More resources could be used for posting.

Whether SER does the scraping or it is done with something else, SER still has the do the "identify" process. Whether it does it all in one batch or little by little, it should not "work much faster" one way or the other. It should be going through the same logic - therefore, taking the same amount of time.

Even if I feed "batches" into Tier 2/3 as you suggest, it will still have to go through the same "identfy" process that it is doing now. And as I mentioned in the original post, I am only trying to process 18K urls now.

hoolak · February 2013

Yes its piss slow. @Sven should work on idenfying URL's at a much faster rate.

AlexR · February 2013

@DavidA2 - totally agree! This feature is super slow. What would also be nice is a background feature that rechecks the sitelist entries to ensure they are all still active.

squirrelhunter · February 2013

18K targets only? Put that into a project with 150 Threads and it will take 10min or less to go through. Import that to identify it will take 3-4 times longer

skyf · March 2013

glad i found this thread and it's not just my comp problem

have my settings at 250 threads + html timeout 130

and it's taking HOURS to get through my list from sb...

skyf · March 2013

great program! just this feature is a little slow

skyf · March 2013

@squirrelhunter

"Dont import lists.. takes too long time and you cant use anything else
in the program while its doing that.. just split it in batches of 50.000
urls and feed them to your Tier2/3 projects, works much faster and the
good ones will sort into your 'submitted' or "identified" list under
Options->Tools automatically if you activated the checkbox there"

top tip!!!

just realized what you said there...

and using this in different projects, it will allows us to 'import and sort' multi threaded as well!!

scp · March 2013

I use Excel to manage my scraped lists. Have two tabs with master list on one and newley scraped URLs on the other. Then I do a Vlookup to find the new ones. These new ones go into SER and also over to the master list tab. The processing takes about 15 minutes in Excel to run thru the list but it beats having to wait for SER to run thru the list.

skyf · March 2013

@scp, thanks for the tip

excel has a limit on 60k+ rows

do you keep duplicate domain (or multiple urls from the same domain)?

scp · March 2013

Don't believe everything M$ tell you

2007 and newer versions can handle A LOT of rows. Like I said, it takes a while to process them in a formula. You can't move the mouse or anything once it starts processing. Kick off the formula and walk away for 10 minutes.

skyf · March 2013

lol, i am using 2007 and i got capped at 60+ rows

time to upgrade to a new version i guess... ahhaahha

Ozz · March 2013

you know that you can use scrapebox instead of excel for sorting the new URLs, right?

excel sounds to me a bit complicated??! but maybe i'm wrong.

skyf · March 2013

@ozz didn't think of that...

you mean after getting a harvested list, we import our master list and remove dup? that will end up with a master unique list though

how can we use sb to just sort out the new unique v.s. old and imported urls?

Ozz · March 2013

IIRC you have to save your fresh list to file first.

then you need to import your master list to scrapebox and import the fresh list with "select the URL lists to compare".

save the list with the new URLs and export it to SER. then you need to merge the new URLs with your masterlist.

done.

skyf · March 2013

thanks @ozz

will try that after my harvesting run is done

skyf · March 2013

"import your master list to scrapebox and import the fresh list with "select the URL lists to compare""

confirmed - works like a charm!

scp · March 2013

The "select URL lists to compare" doesn't seem to be working. I did a test like this:

master file contains 2 urls:

www.1.com
www.2.com

Test scrape file contains 4 urls:

www.1.com
www.2.com
www.3.com
www.4.com

I then imported into Scrapebox using the process @ozz mentioned and it didn't work.

I'm looking for a way to tell me that www.3.com and www.4.com are unique and remove the duplicates. Scrapebox isn't doing that for me. Am I missing a step?

skyf · March 2013

@scp it works, here is what i did:

newly scraped list is in the Harvested box

no need to save it (it's already in the harvester_sessions)

then

import url list > select the url lists to compare > browse to your MASTER list

Harvested box only has new found urls

Brumnick · March 2013

@scp This will give you what you need: http://jura.wi.mit.edu/bioc/tools/compare.php

skyf · March 2013

adding a million lines to the online tool might force it to go timeout? just a thought

DavidA2 · March 2013

Come on guys! This is one of those instances where you should be able to figure things out and find solutions for yourself! In case you haven't heard about it, Google has this cool search tool that you can use to find things on the internet.

Oh...don't want to do that...okay I'll spoon feed you this one time. Here are three other options available to compare files and remove duplicates.

http://textmechanic.com/ - also lots of other cool tools
http://wonderwebware.com/duplicatefinder/
http://www.scrapebox.com/free-dupe-remove

skyf · March 2013

sb ftw...

Import URLs is taking forever

Comments