@bingopro, I'm obviously not going to say. If I did it'd make my source useless as it would get hammered by several people.
Also, a word of advice. If you are going to do long scraping sessions with gscraper, make sure to have it create several files. I'm now sitting with a file of 110 million lines (7,5gb) and I still can't figure out how to dedupe it. Nothing will split the file so far, and scrapebox's dupremove addon won't handle it either. I'd appreciate any recommendations.
@fakenickahl gscraper will take forever to open something that size. I have custom coded program that runs on linux os and can handle anything I throw at it.
If you want I can try to clean it for you, send me message...I understand if you dont so no worries
@fakenickahl - Try splitting the file, merge back together, and then remove duplicate URLs. I haven't done this with a file quite as big as yours, but it's worked on files ~10% the size.
Thanks for all your suggestions guys! I hadn't imagined I'd get much help, just felt like whining a little.
@Justin, I just tried sertools. When splitting the file it only got about 1/4th of the file, and when trying to dedupe it, I let it try for about two hours. It hadn't created any files and the software was unresponsive while using 15gb ram, so I decided to just shut it down. They've got some very nice tools on their homepage though, I'll definitely try them out.
@gooner, didn't really think of gscraper because I thought it'd be the last tool that'd handle such a large list, thanks though, I might just give it a try
@jpvr90, that's a great offer. I'll first exhaust a couple more options, but I just might take you up on it. I realize you're also going to use the list, but that only seems fair to me.
@johng, my wish is to do so, but I've found no tool that'll allow me to split such a large file before either crashing or stopping before everything has been split.
I think I've tried 6 different tools and text editors which I was recommended, and I've got 2 tools taking a crack the list this very moment. I've definitely learnt my lesson and I'm splitting my files up as I scrape now
Here's a crazy idea... Create 20 projects in SER, import the list and split across all projects, then export the target url's from each project individually... You'll end up with a split list. That might just work or it might kill SER
@fakenickahl I don't know if this is the case with you but 2 times in my case gscraper has "corrupted" some text files somehow when scraping/saving and I could not open these with anything. Files were not even that big and I have cleaned/deduped much larger files than those.
I buy 30 privat proxy (http://wooproxy.com/). In scapebox's setting i use google harvesting 4. And i only scrape 200K links. i test proxy - it's alive. If i wait one hour i can continue scrap link (((. I also buy private proxy by SquidProxies. and i have a problem too.
How do I get around scrapebox only being able to scrape 1mil links? I mean I have this 500K keyword list + footprint list, I am currently sitting here grabbing like 1500 KWs at a time and running them thru scrapebox.. very annoying and will take FOREVER with 500K... Is there a way to get scrapebox to do it and it will find all links but only shows 1 mil? And save to files?
or am I screwed and have to do it little by little
@tsaimllc, @davbel - there is a workaround for your issue. Split your mega list in smaller chunks and buy SB automator plugin (20$ or 25$). You can setup sb to do everything on its own - scrape, dedupe, save, test proxies. Add many projects at once and thats it.
Comments
Just kidding... GScraper can't de-dup/split it?
If you want I can try to clean it for you, send me message...I understand if you dont so no worries
wait a few hours, and then scrape slower
I haven't used it, but I think Gscraper doesn't have this limitation if that helps