Skip to content

anyone know of a bulk (multimillion) URL level compare function tool other than xr?

The url level compare function that scrapebox has, to comepare a new list to an old scrub list to avoid processing any of the same urls again. I need something that can do this with two many million url lists. I can only find tools that do it on a domain level, I need something for specific url level as I dont want to lose inner url of many domains like comments.

thanks

Comments

  • @googlealchemist you should do this type of stuff on a UNIX shell with sed (or with GNU Win32 tools http://gnuwin32.sourceforge.net/packages/sed.htm) This is a powerful tool that uses REGEX and it is short for "stream editor"... I would merge the files together, and build a REGEX to match the string, and either delete the line, or put a special character in the line you can recognize to rip it out later. Be sure to backup your URLs before messing with sed. Like I said it is powerful (and VERY fast).
  • googlealchemistgooglealchemist Anywhere I want
    thanks, ill save this info to see if a tech savy outsourcer can implement as it is way beyond me
  • You don't need a tech savvy outsourcer, you just need a linux shell. I've done this before it's easy as long as the line that holds the url is 100% identical to the ones you want to delete. Just do this:

    ~$ sort urls.txt > sortedurls.txt
    ~$ uniq sortedurls.txt > dedupedurls.txt

    That's it. No regex or sed needed unless you want to trim the urls in some way. You can download cygwin and do this on a windows box too.
Sign In or Register to comment.