Skip to content

how to import 12 million urls without freezing?

before, I tried to import a list of 3 million urls and SER froze. I am not sure if it really froze, but I started importing and logged off my vps, came back 24 hours and it was still importing, I dont know if it froze or what happened, I had to force it to close using control alt delete.

Now I just scraped a list of 12 million urls and I am scared of importing it. anything I can do to avoid problems?

Comments

  • bestimtoolzbestimtoolz High PR WEB 2.0 posting service - affordable !
    Do you have SATA or SSD VPS ?
    SSD is a lot faster, SATA drives can consume a lot of Proc resources during read/write and it may takes ages to import such list especially on VPS
  • 1linklist1linklist FREE TRIAL Linklists - VPM of 150+ - http://1linklist.com
    What are the specs on your VPS? I have some seriously cracked out machines, and even for me importing 12 million urls takes a bit. Just the other day I imported something like 81 million, and it took a solid hour to go through.

    Chunking your imports up seems to help. Try doing 4-5 million at a time.
  • This is a SATA vps, I am not sure of the specs but it's not very good, I think 1gb and 60gb
    how you I chunk it up? scrapebox cant handle such task.


  • 1linklist1linklist FREE TRIAL Linklists - VPM of 150+ - http://1linklist.com
    When I have to work with large textfiles, I use the following wonky little app:

    http://www.softpedia.com/get/System/File-Management/Text-File-Splitter.shtml

    When you chunk it, try importing smaller chunks at a time - not all at once.

    I'm betting your issue the 1gb of ram. ;)
  • Just use this and split the file into manageable chunks:

  • @RuFFCuT‌ , it was my understanding that the current iteration of SB was limited in the number of keywords/footprints and URLs it Ca handle at any one time. The keywords/footprints seem to crash aproaching 1 mil and the dup remover (which is real quick) is limited by system memory.

    This would rule it out in this instance. I have found several programs for dupremoval and some on this forum. But all seem limited by the on system ram.

    I am currently 7-ziping the scrapes and downloading processing offline/desktop. This is a real pain. And I am sure there are better ways.

    I am testing to see if g-scraper s speed degrades drastically with the dedupe function on. I suspect on large or long scrapes it does but we will see. If anyone has any suggestions I'd like to hear.
  • RuFFCuTRuFFCuT UK
    edited October 2014
    @icarusVN The Scrapebox tool I mentioned isn't limited like Scrapebox itself is it is a separate tool- read the first paragraph on the tool I sent you: 

    ScrapeBox DupeRemove is a small, fast, lightweight and free tool that allows you to merge multiple text file URL lists in to one large file. Also it can remove duplicate URL’s and duplicate domains from files as large as 180 Million lines long in just a few seconds.

    So that should do exactly what you want - split the files and remove dupes. RAM isn't an issue with it either :) 
  • Great stuff @RuFFCuT‌ . Assumptions are the mother of all ... Thanks
  • ronron SERLists.com
     +1 @RuFFCuT - I never even saw that SB side product before! Freaking excellent.  ^:)^
  • 1linklist1linklist FREE TRIAL Linklists - VPM of 150+ - http://1linklist.com
    edited October 2014
    Woah, I've used that tool for years and never realized it could split files to. Talk about screen blindness.
Sign In or Register to comment.