Skip to content

GScraper freezes while importing list

Hello everyone, I let my gscraper do 1 day scrape and it has yielded 8M link in a text file of 502MB.
Now whenever i try importing that into Gscraper to remove duplicate urls and duplicate domains the blue bar moves a bit and freezes,then Gscraper gives a (Not Responding). I have stopped all programs and only running it on my 2.5GB Ram VPS and it does that.

Any solution?

Comments

  • split the  list using gspliter before importing . 500 Mb is a very huge file .
  • used Gsplit and it produced 5 files each of 100 mb but the extenstion is .GSD which isn't working with gscraper
  • Importing list to gscraper when you need to remove duplicates is wrong way. Here is how you do it:
    image

    500mb is small list. Yesterday i removed duplicates from 10GB txt file with no problems.
  • edited May 2014
    Thanks @satyr85
  • @satyr85 another question please, What can i do when in the middle of the scrape i would like to take what gscraper has produced of links so far to put in gsa while gscraper keeping its scrape running. i had to stop gscraper at 50% of its scrape to do so.

    Also if i scrape again for same keywords will gscraper produce the same links i have scraped before? i made 8M links and removed dubs and they became filtered to 2M.

    Thanks


  • Lets say you take 50% of what gscraper produced, remove duplicates, save to linklist1 and import to GSA, than after scrape is finished you take 100%, save to linklist2 and again import to GSA. You are importing some of links included in linklist1 second time and its big waste of resources.

    You could remove duplicates from linkist1 and from linklist2, and put all domains from linklist1 to blacklist, than remove all domains from linklist2 according to blacklist (for example using Xrumer). With this method you will get unique domains from linklist2 not included in linklist1 but... this method takes ages and simply its not way to go.

    Best option is scrape as many links as you can (for few days), than import all to Gsa but its also problematic. 

    When you scrape for same keywords after lets say 7 days gscrapwer will give you X% new links.

    P.S.
    Stop spamming forum with same question again and again.
  • @satyr85 thanks for your helpful info. Reason i posted question again is that i thought nobody will see my question as a subthread of this original thread.

    Thanks again
  • @satyr85 How long did it take to remove dupes from your 10GB file? I have it running now for a 3.5GB file, and it looks like memory is pinned the same way it was when I tried to import the list.
  • bpm4gsa 
    I dont remember but if you dont have enough ram it will take days or Gscraper will crash. I removed duplicates from big files on 32gb and 48gb servers. How much ram you have on your server (and how much you pay for this server if you dont mind sharing this info here or via PM).
  • I have 2GB RAM ... https://www.solidseovps.com/windowsvps.php ... the $35/mo package.



  • edited July 2014
    @satyr85 - memory runs up to almost 2GB then drops to ~1.6GB, then runs back up, and repeats .... sorry, image paste didn't work. Edited to remove image.
  • bpm4gsa 
    2gb is very small amount of ram and this can be problem. If you add $15-25 per month you can get nice dedicated server, perfect for running ser and gscraper at same time.
  • KaineKaine thebestindexer.com
    edited July 2014
    Yes look your ram in manager and active swap if is not already do (activate be default).

    You can cut in part and dedupe on the fly like pict of satyr.

    For join you can use cmd, go in folder and type:

    type *.txt > nameyouwant.txt

    restart opération again and again.


  • Thank you both for the help.
  • edited July 2014
    BlacKhatPriesT
    @bpm4gsa if you are not going for dedi, and in case you have scrapebox too, dup remover addon works with 180million lines, fast and need less resource than gscraper's dedupe function
Sign In or Register to comment.