Skip to content

How are these churn and burn results possible

2»

Comments

  • @bingopro, I'm obviously not going to say. If I did it'd make my source useless as it would get hammered by several people.

    Also, a word of advice. If you are going to do long scraping sessions with gscraper, make sure to have it create several files. I'm now sitting with a file of 110 million lines (7,5gb) and I still can't figure out how to dedupe it. Nothing will split the file so far, and scrapebox's dupremove addon won't handle it either. I'd appreciate any recommendations.
  • Did you try sertools? @fakenickahl
  • goonergooner SERLists.com
    edited April 2014
    @fakenickahl - I know but i'm not going to say! :D
    Just kidding... GScraper can't de-dup/split it?
  • edited April 2014
    @fakenickahl gscraper will take forever to open something that size. I have custom coded program that runs on linux os and can handle anything I throw at it.

    If you want I can try to clean it for you, send me message...I understand if you dont so no worries :)
  • @fakenickahl - Try splitting the file, merge back together, and then remove duplicate URLs. I haven't done this with a file quite as big as yours, but it's worked on files ~10% the size.
  • Thanks for all your suggestions guys! I hadn't imagined I'd get much help, just felt like whining a little.

    @Justin, I just tried sertools. When splitting the file it only got about 1/4th of the file, and when trying to dedupe it, I let it try for about two hours. It hadn't created any files and the software was unresponsive while using 15gb ram, so I decided to just shut it down. They've got some very nice tools on their homepage though, I'll definitely try them out.

    @gooner, didn't really think of gscraper because I thought it'd be the last tool that'd handle such a large list, thanks though, I might just give it a try :)

    @jpvr90, that's a great offer. I'll first exhaust a couple more options, but I just might take you up on it. I realize you're also going to use the list, but that only seems fair to me.

    @johng, my wish is to do so, but I've found no tool that'll allow me to split such a large file before either crashing or stopping before everything has been split.

    I think I've tried 6 different tools and text editors which I was recommended, and I've got 2 tools taking a crack the list this very moment. I've definitely learnt my lesson and I'm splitting my files up as I scrape now :)
  • goonergooner SERLists.com
    edited April 2014
    Here's a crazy idea... Create 20 projects in SER, import the list and split across all projects, then export the target url's from each project individually... You'll end up with a split list. That might just work or it might kill SER :D
  • @fakenickahl I don't know if this is the case with you but 2 times in my case gscraper has "corrupted" some text files somehow when scraping/saving and I could not open these with anything. Files were not even that big and I have cleaned/deduped much larger files than those.
  • fakenickahl, I use scrapebox dedupe tool to split the file and then you can dedup it. 

    Alternatively, try this http://www.gdgsoft.com/gsplit/
  • @fakenickahl try Once is Enough. I'm sure it can handle silly sized files.
  • I buy 30 privat proxy (http://wooproxy.com/). In scapebox's setting i use google harvesting 4. And i only scrape 200K links. i test proxy - it's alive. If i wait one hour i can continue scrap link (((. I also buy private proxy by  SquidProxies. and i have a problem too.
    What i'm doing wrong? Help me plz
  • molchomolcho Germany
    edited April 2014
    your proxys are blocked by the search engine,

    wait a few hours, and then scrape slower
  • How do I get around scrapebox only being able to scrape 1mil links? I mean I have this 500K keyword list + footprint list, I am currently sitting here grabbing like 1500 KWs at a time and running them thru scrapebox.. very annoying and will take FOREVER with 500K... Is there a way to get scrapebox to do it and it will find all links but only shows 1 mil? And save to files? or am I screwed and have to do it little by little
  • I don't think you can, I think it's going to have to be bit by bit.

    I haven't used it, but I think Gscraper doesn't have this limitation if that helps
  • @tsaimllc, @davbel - there is a workaround for your issue. Split your mega list in smaller chunks and buy SB automator plugin (20$ or 25$). You can setup sb to do everything on its own - scrape, dedupe, save, test proxies. Add many projects at once and thats it.
  • goonergooner SERLists.com
    @rayban - Nice tip, i didn't know about that plugin.

  • ronron SERLists.com
    Good job @rayban - I bought that serp checker plugin last year, but forgot about the other two premium plugins.
  • For the guys saying about proxies for gscraper, I'm using red-proxie, here is my url's per minute


    Of course this decrease a little bit some time, but still... 
Sign In or Register to comment.