How would you go about to identify millions of urls?

sysco32 · February 2015

I run 4 scrapebox instances at the same time around 20 hrs,than i feed him the next list to scrape.So basically i do have 4 harvest session folder with millions of urls in it.Let's say one scraping gives me around 10+ million urls.Dedup it than it comes to 200K-1Million URLs.
I just purchased Pi,so i have around 1 week of urls.I don't even want to count how much is that.
How would you go about it to identify them.
In case i run for different project,one for each folder will it give me 4 different set of files,or all of the project will write the same file?/I don't think so,just to be sure./

s4nt0s · February 2015

You can set the destination folder for each project so if you set all 4 projects to have the same destination folder, they will all be saving the sorted files to that folder.

If each has its own destination folder, then each will folder will have its own set of sorted files.

Is that what you mean?

sysco32 · February 2015

All of them are going to use the same file?OR they will create blog 1,blog 2 ,blog 3, blog 4 files?

s4nt0s · February 2015

If they're saved to the same destination folder, they will all by saved by platform like this:

It will append the URLS to the files so it doesn't matter how many projects are writing to that folder, they will all save in their categories like that.

sysco32 · February 2015

Ok i will do this than.Thank you

spiritfly · February 2015

Yes I'm doing exactly this without any issues. I have one project in PI processing the urls from engine number 1, and another project in PI processing a different set of urls from engine number 2. They both save the identified urls into a single folder and thus my list is becoming more complete.

And on top of that, I have both projects set to write unidentified urls to a single folder. Then I have a third project in PI to try and identify all engines from that folder and write the result of identified urls into the same folder both projects do.

Everything works coordinated like a beautiful orchestra!

How would you go about to identify millions of urls?

Comments