How would you go about to identify millions of urls?
sysco32
Skopje
I run 4 scrapebox instances at the same time around 20 hrs,than i feed him the next list to scrape.So basically i do have 4 harvest session folder with millions of urls in it.Let's say one scraping gives me around 10+ million urls.Dedup it than it comes to 200K-1Million URLs.
I just purchased Pi,so i have around 1 week of urls.I don't even want to count how much is that.
How would you go about it to identify them.
In case i run for different project,one for each folder will it give me 4 different set of files,or all of the project will write the same file?/I don't think so,just to be sure./
I just purchased Pi,so i have around 1 week of urls.I don't even want to count how much is that.
How would you go about it to identify them.
In case i run for different project,one for each folder will it give me 4 different set of files,or all of the project will write the same file?/I don't think so,just to be sure./
Comments
And on top of that, I have both projects set to write unidentified urls to a single folder. Then I have a third project in PI to try and identify all engines from that folder and write the result of identified urls into the same folder both projects do.
Everything works coordinated like a beautiful orchestra!