Skip to content

Dedupe Project & Tool causing Out of Memory error

I tried processing ~20 files each 8mb and ran into a out of memory error using the Tool dedupcate URL feature.

Also, the Project Remove Duplicates stops when processing many files that are large. Probably the same error as above.

After no success of deduping files I turned to SER since the files were in sitelist format. I loaded in the files into SER and deduped successfully. Seems the deduplication algorithm is different and inferior to SER.

Please fix

Comments

  • s4nt0ss4nt0s Houston, Texas
    When you say sitelist format do you mean you're loading in .SL files or just .txt files?
  • Yes sitelist organized text files, not the zipped *.sl
  • edited August 2015
    @kp55

    Currently using this method to prep for dedupe via gsapi and then I split the file into smaller pieces, I split it seperately just in case. You can also just skip 1 and use command 2 depending on your server resources.

    Note:
    You can use on your linux box or download cygwin
    This only removes duplicate urls, not domains. (14GB total takes about 15-30 mins)

    command 1
    cat /home/txt-files/* |LC_ALL='C' uniq|less > /home/output.txt

    command 2
    split -dl 50000 --additional-suffix=.txt /home/output.txt /home/split/


    I'm not getting any errors on gsapi yet, export to project or save, split up the dupe check file amount, save to multiple files if you have errors. Does 3+ million urls per minute dedupe on gsapi.
    This is on a 16gb dedi windows server 2008. E3-1220 3.10 GHZ
  • issue was resolved in the last update
Sign In or Register to comment.