Skip to content

How to handle big files? How to filter them?

Hi guys, i have small problem, xrumer / scrapebox / gscraper cant handle it.

I have 2 domain lists:

1 list = 20 milions
2 list  = 10 milions

I want to filter out all domains that exists in 1 list from 2 list.
These list are growing each day and i dont know how to handle bigger lists in future, someone can help me?

Comments

  • s4nt0ss4nt0s Houston, Texas
    edited May 2014
    The free SB dupe remove/file splitter doesn't work for you? http://www.scrapebox.com/free-dupe-remove
  • edited May 2014
    I told you - xrumer / scrapebox / gscraper cant handle them.

    And this addon dont have this option. I have to filter domains, not remove dups.
    And no, i cant split them into 1 file, remove dups and split again into 2 files.

    I need something like that:

    image


    Xrumer can handle it, but it takes a lot of time to load the files. I think maybe there something faster.
  • s4nt0ss4nt0s Houston, Texas
    Ya I saw you mentioned SB but didn't know if you tried the free tool. I'm not sure :/
  • edited May 2014
    This addon is already in Scrapebox :) But thank you
  • http://sourceforge.net/projects/textwedge/

    Try this, not sure if it can handle 20 million but it's worked for me with around 10 million before.


  • Its not working "out of memory" any other ideas?
  • How big are the files? 5GB or what?
  • Domains, its about domains, 25 milions domains in list A i have to remove them from list B (5 milion domains there)
  • Yeah I understand that but how big is the file, the reason you can't open it in scrapebox or anything is because the file contains too much data - there are other specialist tools that can do the same thing but I need to know how big the file is in MB or GB before I can recommend one to you.

    I've worked with 8GB text files before (error logs), splitting, opening, editing and removing duplicate data from them so there will be something.
  • KaineKaine thebestindexer.com
    edited June 2014
    25 million is not hudge ... Also tells us how much free RAM you have.
  • @Kaine spot on  . This completely depends on RAM and not on the tools . Gscraper can handle 25 mil without breaking a sweat
  • there is many programs, that divide big txt on smaller one, for examle 1 file with 10 million urls you can separete on 20 for 500000 each, just google it
  • Sure it is all about the RAM since the data is stored in this while the file opens which causes the program or your PC to crash out.

    However there are some programs that do not use just RAM to store the tempoary data; so even with a little RAM you can still open it.
  • edited June 2014
    Ram is not a problem, file size is not a problem, just the right tool :)
    You have to understand the task that is given.

    Delete all domains that are in list A from list B.
    This is not "remove duplicate domains"
    Yes gscraper can open these files, but there is not such options.

    Yes this amount of domains is not huge, but for that task it is.
    They are also growing all the time.

    Xrumer have such option but after 10 hours task is completed in 3% - xrumer took only 600mb of ram so stop telling me about ram.

Sign In or Register to comment.