Import URLs is taking forever
I have never used the "Import URLs" function before. I scraped a couple million urls using other tools and footprints for GSA SER sites. I am trying a small portion of them (under 20K) just to see how it works. It has been running for two hours and has "Identified: 484" and "Unknown: 5555". This seems terrible to me. Is this normal speed for this process? If not, how can I figure what I am doing wrong?
Comments
Whether SER does the scraping or it is done with something else, SER still has the do the "identify" process. Whether it does it all in one batch or little by little, it should not "work much faster" one way or the other. It should be going through the same logic - therefore, taking the same amount of time.
Even if I feed "batches" into Tier 2/3 as you suggest, it will still have to go through the same "identfy" process that it is doing now. And as I mentioned in the original post, I am only trying to process 18K urls now.
have my settings at 250 threads + html timeout 130
and it's taking HOURS to get through my list from sb...
"Dont import lists.. takes too long time and you cant use anything else
in the program while its doing that.. just split it in batches of 50.000
urls and feed them to your Tier2/3 projects, works much faster and the
good ones will sort into your 'submitted' or "identified" list under
Options->Tools automatically if you activated the checkbox there"
top tip!!!
just realized what you said there...
and using this in different projects, it will allows us to 'import and sort' multi threaded as well!!
excel has a limit on 60k+ rows
do you keep duplicate domain (or multiple urls from the same domain)?
2007 and newer versions can handle A LOT of rows. Like I said, it takes a while to process them in a formula. You can't move the mouse or anything once it starts processing. Kick off the formula and walk away for 10 minutes.
time to upgrade to a new version i guess... ahhaahha
you mean after getting a harvested list, we import our master list and remove dup? that will end up with a master unique list though
how can we use sb to just sort out the new unique v.s. old and imported urls?
will try that after my harvesting run is done
confirmed - works like a charm!
master file contains 2 urls:
www.1.com
www.2.com
Test scrape file contains 4 urls:
www.1.com
www.2.com
www.3.com
www.4.com
I then imported into Scrapebox using the process @ozz mentioned and it didn't work.
I'm looking for a way to tell me that www.3.com and www.4.com are unique and remove the duplicates. Scrapebox isn't doing that for me. Am I missing a step?
newly scraped list is in the Harvested box
no need to save it (it's already in the harvester_sessions)
then
import url list > select the url lists to compare > browse to your MASTER list
Harvested box only has new found urls
Oh...don't want to do that...okay I'll spoon feed you this one time. Here are three other options available to compare files and remove duplicates.
http://textmechanic.com/ - also lots of other cool tools
http://wonderwebware.com/duplicatefinder/
http://www.scrapebox.com/free-dupe-remove