anyone know of a bulk (multimillion) URL level compare function tool other than xr?

googlealchemist · December 2013

The url level compare function that scrapebox has, to comepare a new list to an old scrub list to avoid processing any of the same urls again. I need something that can do this with two many million url lists. I can only find tools that do it on a domain level, I need something for specific url level as I dont want to lose inner url of many domains like comments.

thanks

coneh34d · December 2013

@googlealchemist you should do this type of stuff on a UNIX shell with sed (or with GNU Win32 tools http://gnuwin32.sourceforge.net/packages/sed.htm) This is a powerful tool that uses REGEX and it is short for "stream editor"... I would merge the files together, and build a REGEX to match the string, and either delete the line, or put a special character in the line you can recognize to rip it out later. Be sure to backup your URLs before messing with sed. Like I said it is powerful (and VERY fast).

googlealchemist · December 2013

thanks, ill save this info to see if a tech savy outsourcer can implement as it is way beyond me

nipester · December 2013

You don't need a tech savvy outsourcer, you just need a linux shell. I've done this before it's easy as long as the line that holds the url is 100% identical to the ones you want to delete. Just do this:

~$ sort urls.txt > sortedurls.txt
~$ uniq sortedurls.txt > dedupedurls.txt

That's it. No regex or sed needed unless you want to trim the urls in some way. You can download cygwin and do this on a windows box too.

anyone know of a bulk (multimillion) URL level compare function tool other than xr?

Comments