How are these churn and burn results possible

fakenickahl · April 2014

@bingopro, I'm obviously not going to say. If I did it'd make my source useless as it would get hammered by several people.

Also, a word of advice. If you are going to do long scraping sessions with gscraper, make sure to have it create several files. I'm now sitting with a file of 110 million lines (7,5gb) and I still can't figure out how to dedupe it. Nothing will split the file so far, and scrapebox's dupremove addon won't handle it either. I'd appreciate any recommendations.

Justin · April 2014

Did you try sertools? @fakenickahl

gooner · April 2014

@fakenickahl - I know but i'm not going to say!

Just kidding... GScraper can't de-dup/split it?

jpvr90 · April 2014

@fakenickahl gscraper will take forever to open something that size. I have custom coded program that runs on linux os and can handle anything I throw at it.

If you want I can try to clean it for you, send me message...I understand if you dont so no worries

johng · April 2014

@fakenickahl - Try splitting the file, merge back together, and then remove duplicate URLs. I haven't done this with a file quite as big as yours, but it's worked on files ~10% the size.

fakenickahl · April 2014

Thanks for all your suggestions guys! I hadn't imagined I'd get much help, just felt like whining a little.

@Justin, I just tried sertools. When splitting the file it only got about 1/4th of the file, and when trying to dedupe it, I let it try for about two hours. It hadn't created any files and the software was unresponsive while using 15gb ram, so I decided to just shut it down. They've got some very nice tools on their homepage though, I'll definitely try them out.

@gooner, didn't really think of gscraper because I thought it'd be the last tool that'd handle such a large list, thanks though, I might just give it a try

@jpvr90, that's a great offer. I'll first exhaust a couple more options, but I just might take you up on it. I realize you're also going to use the list, but that only seems fair to me.

@johng, my wish is to do so, but I've found no tool that'll allow me to split such a large file before either crashing or stopping before everything has been split.

I think I've tried 6 different tools and text editors which I was recommended, and I've got 2 tools taking a crack the list this very moment. I've definitely learnt my lesson and I'm splitting my files up as I scrape now

gooner · April 2014

Here's a crazy idea... Create 20 projects in SER, import the list and split across all projects, then export the target url's from each project individually... You'll end up with a split list. That might just work or it might kill SER

jpvr90 · April 2014

@fakenickahl I don't know if this is the case with you but 2 times in my case gscraper has "corrupted" some text files somehow when scraping/saving and I could not open these with anything. Files were not even that big and I have cleaned/deduped much larger files than those.

nitinsy · April 2014

fakenickahl, I use scrapebox dedupe tool to split the file and then you can dedup it.

Alternatively, try this http://www.gdgsoft.com/gsplit/

JudderMan · April 2014

@fakenickahl try Once is Enough. I'm sure it can handle silly sized files.

zulufort · April 2014

I buy 30 privat proxy (http://wooproxy.com/). In scapebox's setting i use google harvesting 4. And i only scrape 200K links. i test proxy - it's alive. If i wait one hour i can continue scrap link (((. I also buy private proxy by SquidProxies. and i have a problem too.

What i'm doing wrong? Help me plz

molcho · April 2014

your proxys are blocked by the search engine,

wait a few hours, and then scrape slower

tsaimllc · April 2014

How do I get around scrapebox only being able to scrape 1mil links? I mean I have this 500K keyword list + footprint list, I am currently sitting here grabbing like 1500 KWs at a time and running them thru scrapebox.. very annoying and will take FOREVER with 500K... Is there a way to get scrapebox to do it and it will find all links but only shows 1 mil? And save to files? or am I screwed and have to do it little by little

davbel · April 2014

I don't think you can, I think it's going to have to be bit by bit.

I haven't used it, but I think Gscraper doesn't have this limitation if that helps

RayBan · April 2014

@tsaimllc, @davbel - there is a workaround for your issue. Split your mega list in smaller chunks and buy SB automator plugin (20$ or 25$). You can setup sb to do everything on its own - scrape, dedupe, save, test proxies. Add many projects at once and thats it.