GScraper freezes while importing list

BlacKhatPriesT · May 2014

Hello everyone, I let my gscraper do 1 day scrape and it has yielded 8M link in a text file of 502MB.

Now whenever i try importing that into Gscraper to remove duplicate urls and duplicate domains the blue bar moves a bit and freezes,then Gscraper gives a (Not Responding). I have stopped all programs and only running it on my 2.5GB Ram VPS and it does that.

Any solution?

DonCorleone · May 2014

split the list using gspliter before importing . 500 Mb is a very huge file .

BlacKhatPriesT · May 2014

used Gsplit and it produced 5 files each of 100 mb but the extenstion is .GSD which isn't working with gscraper

satyr85 · May 2014

Importing list to gscraper when you need to remove duplicates is wrong way. Here is how you do it:

500mb is small list. Yesterday i removed duplicates from 10GB txt file with no problems.

BlacKhatPriesT · May 2014

Thanks @satyr85

BlacKhatPriesT · May 2014

@satyr85 another question please, What can i do when in the middle of the scrape i would like to take what gscraper has produced of links so far to put in gsa while gscraper keeping its scrape running. i had to stop gscraper at 50% of its scrape to do so.

Also if i scrape again for same keywords will gscraper produce the same links i have scraped before? i made 8M links and removed dubs and they became filtered to 2M.

Thanks

satyr85 · May 2014

Lets say you take 50% of what gscraper produced, remove duplicates, save to linklist1 and import to GSA, than after scrape is finished you take 100%, save to linklist2 and again import to GSA. You are importing some of links included in linklist1 second time and its big waste of resources.

You could remove duplicates from linkist1 and from linklist2, and put all domains from linklist1 to blacklist, than remove all domains from linklist2 according to blacklist (for example using Xrumer). With this method you will get unique domains from linklist2 not included in linklist1 but... this method takes ages and simply its not way to go.

Best option is scrape as many links as you can (for few days), than import all to Gsa but its also problematic.

When you scrape for same keywords after lets say 7 days gscrapwer will give you X% new links.

P.S.

BlacKhatPriesT

https://forum.gsa-online.de/discussion/11243/gscraper-question

Stop spamming forum with same question again and again.

BlacKhatPriesT · May 2014

@satyr85 thanks for your helpful info. Reason i posted question again is that i thought nobody will see my question as a subthread of this original thread.

Thanks again

bpm4gsa · July 2014

@satyr85 How long did it take to remove dupes from your 10GB file? I have it running now for a 3.5GB file, and it looks like memory is pinned the same way it was when I tried to import the list.

satyr85 · July 2014

bpm4gsa

I dont remember but if you dont have enough ram it will take days or Gscraper will crash. I removed duplicates from big files on 32gb and 48gb servers. How much ram you have on your server (and how much you pay for this server if you dont mind sharing this info here or via PM).

bpm4gsa · July 2014

I have 2GB RAM ... https://www.solidseovps.com/windowsvps.php ... the $35/mo package.

bpm4gsa · July 2014

@satyr85 - memory runs up to almost 2GB then drops to ~1.6GB, then runs back up, and repeats .... sorry, image paste didn't work. Edited to remove image.

bpm4gsa · July 2014

http://imgur.com/U9dHa4r

satyr85 · July 2014

bpm4gsa

2gb is very small amount of ram and this can be problem. If you add $15-25 per month you can get nice dedicated server, perfect for running ser and gscraper at same time.

Kaine · July 2014

Yes look your ram in manager and active swap if is not already do (activate be default).

You can cut in part and dedupe on the fly like pict of satyr.

For join you can use cmd, go in folder and type:

type *.txt > nameyouwant.txt

restart opération again and again.

bpm4gsa · July 2014

Thank you both for the help.

derdor · July 2014

BlacKhatPriesT
@bpm4gsa if you are not going for dedi, and in case you have scrapebox too, dup remover addon works with 180million lines, fast and need less resource than gscraper's dedupe function

GScraper freezes while importing list

Comments