how to import 12 million urls without freezing?

bigredmachine · October 2014

before, I tried to import a list of 3 million urls and SER froze. I am not sure if it really froze, but I started importing and logged off my vps, came back 24 hours and it was still importing, I dont know if it froze or what happened, I had to force it to close using control alt delete.

Now I just scraped a list of 12 million urls and I am scared of importing it. anything I can do to avoid problems?

bestimtoolz · October 2014

Do you have SATA or SSD VPS ?
SSD is a lot faster, SATA drives can consume a lot of Proc resources during read/write and it may takes ages to import such list especially on VPS

1linklist · October 2014

What are the specs on your VPS? I have some seriously cracked out machines, and even for me importing 12 million urls takes a bit. Just the other day I imported something like 81 million, and it took a solid hour to go through.

Chunking your imports up seems to help. Try doing 4-5 million at a time.

bigredmachine · October 2014

This is a SATA vps, I am not sure of the specs but it's not very good, I think 1gb and 60gb
how you I chunk it up? scrapebox cant handle such task.

1linklist · October 2014

When I have to work with large textfiles, I use the following wonky little app:

http://www.softpedia.com/get/System/File-Management/Text-File-Splitter.shtml

When you chunk it, try importing smaller chunks at a time - not all at once.

I'm betting your issue the 1gb of ram.

RuFFCuT · October 2014

Just use this and split the file into manageable chunks:

http://www.scrapebox.com/free-dupe-remove

icarusVN · October 2014

@RuFFCuT‌ , it was my understanding that the current iteration of SB was limited in the number of keywords/footprints and URLs it Ca handle at any one time. The keywords/footprints seem to crash aproaching 1 mil and the dup remover (which is real quick) is limited by system memory.

This would rule it out in this instance. I have found several programs for dupremoval and some on this forum. But all seem limited by the on system ram.

I am currently 7-ziping the scrapes and downloading processing offline/desktop. This is a real pain. And I am sure there are better ways.

I am testing to see if g-scraper s speed degrades drastically with the dedupe function on. I suspect on large or long scrapes it does but we will see. If anyone has any suggestions I'd like to hear.

RuFFCuT · October 2014

@icarusVN The Scrapebox tool I mentioned isn't limited like Scrapebox itself is it is a separate tool- read the first paragraph on the tool I sent you:

ScrapeBox DupeRemove is a small, fast, lightweight and free tool that allows you to merge multiple text file URL lists in to one large file. Also it can remove duplicate URL’s and duplicate domains from files as large as 180 Million lines long in just a few seconds.

So that should do exactly what you want - split the files and remove dupes. RAM isn't an issue with it either

icarusVN · October 2014

Great stuff @RuFFCuT‌ . Assumptions are the mother of all ... Thanks

ron · October 2014

+1 @RuFFCuT - I never even saw that SB side product before! Freaking excellent. ^:)^

1linklist · October 2014

Woah, I've used that tool for years and never realized it could split files to. Talk about screen blindness.

how to import 12 million urls without freezing?

Comments