Sequential list de-duping

team74 · May 2013

Hi, I'll start by saying thanks for the awesome software, I've got multiple licenses and I've made a lot of money with GSA so cheers!

I've mentioned this before, but the de-duping URLs function hangs, which means I have to go through it selecting checkboxes one at a time. I've spent many hours doing this until now.

I also use notepad++ to dedupe some of the largest lists just to ease the load so it doesn't make GSA hang or crash.

This is probably because I have 40 million+ identified URLs, but in any case GSA SER seems to attempt opening all the selected site lists at once, the hard-drive read/write speeds hit max and then the server just seems to stop, and the processes are no longer in the list and it has clearly failed.

@ Sven, is there any way you can make the program go through each site list, one by one in sequential order so it's not a bottleneck.

Or at least make it easier to select groups of list, rather than one at a time, please!

doubleup · May 2013

Bit weird that you have that issue, as i also have a large identified folder (roughly 5GB with duplicate urls removed), and it doesn't really take that long to de-dupe. I do it a similar way to you as well, except i use scrapebox to dedupe my general blog comment file, as it's over 2GB alone, and i leave GSA SER to de-dupe the rest. Scrapebox takes a few minutes to de-dupe the blog file, while GSA SER probably takes about 10 minutes or so to dedupe the rest.

team74 · May 2013

You dedupe all of them except the general blog file...at the same time?!?!

I've tried on 2 dedis, both high spec (8 cores at 3.3GHz+), but you can see in the resource monitor that the bottle neck is the read/write speed to the hard-drive.

yavuz · May 2013

why do you dupe them ?

doubleup · May 2013

@team74 I move the general blog file to another folder, then de-dupe that with scrapebox, then once that is complete, i de-dupe all the files within the 'identified' folder via GSA SER. Once thats complete, i then move the general blog file back into the 'identified' folder again, and i'm good to go.

@yavuz You do it to get rid of the duplicate urls within your site list, otherwise you'll be wasting time trying to submit to links you've already submitted to etc etc

team74 · May 2013

@ Doubleup, thanks for that answer, it's very insightful because I didn't realize you could de-dupe by list type (Ident,Success,Verified,Failed).

I've had another look, and I can't see where where you do that, can you point that out to me please?

Also, I am familiar with duperemove but it's no better than notepad++ (except for larger files) because you still have to do them one at a time.

doubleup · May 2013

@team74 My mistake, I didn’t mean just the ‘identified’ folder, but all folders. Thinking about it further, do you have ‘save identified sites to’ ticked within the advanced section of options? As obviously, any and every single link GSA SER comes across, it’ll write to it’s relevant file within the folder, and your list will be massive with a lot of duplicates. I only have submitted and verified ticked. I use to also save identified, but that use to produce the issue you mention, so maybe that has something to do with how longs it’s taking you to dedupe.

Your right about scrapebox’s dedupe, but as I mention, I only use it on one file, as it’s very large. When it comes to the large files (300MB+ for example), I’d use a tool such as scrapebox’s duplicate remover addon, but anything less than that I’d just allow GSA SER to do, as in my experience, it’ll fly through the lot pretty quickly.

team74 · May 2013

Ok, that makes sense. As it turns out, my dedi did manage to crunch through everything during the night, even if it did take 8 hours!

Yes I save all urls, even the failed ones.

I like to keep the identifieds too so I can put them through XRumer.

5GB of success and verifieds eh? Pretty impressive. Cheers.

Sven · May 2013

Why do you click each item to be checked/unchecked? You have an popup menu doing this for you. Though it should never crash or hang on the dedupe. Can you provide your hughe lists somehow to reproduce this?

AlexR · May 2013

I have been thinking that surely it makes sense that if SER finds a moment where it has some free resources, it should dedupe lists as a background task?

Or at least give us the option to dedupe these things on a schedule, like every 3 days, run a quick dedupe. This way, it keeps it ordered.

I think Sven is also finding a solution to all the dead links in the lists, so that they will be neater, so that should also help this.

@sven - I've been wondering about reducing the size of these lists. So a verified link is stored in all three lists, so we have a triplication here. What if we had an option to neaten lists and if a link is in the verifieds lists, it is then removed from the submitted and the identified list. This way, you can curate lists for different link levels. This should reduce the amount of links by at least 1/3 but maybe closer to half if we could do this. If a user wanted to use a mix of links for projects, they could then select "Verified" & "Submitted" and it would be the same as if you currently selected just "Submitted". Surely this would reduce lists sizes?

Sven · May 2013

@AlexR

There is no improvement in dedupe. This is just for the people who like things being organized. It is not improving speed or submission.

Removing one URL from a list when it is added to another would mean that the program must keep the list in memory...a huge waste of resources...so no ... never.

Background task: Sorry also a no, it would again mean to load too many data into memory. And you never know when a thread needs how much amount of memory to go on.

m3ownz · May 2013

@team74 @doubleup You may find the following usefull:

https://dl.dropboxusercontent.com/u/14851340/SimpleDeDupe.rar

Its a simple app i wrote to dedupe lists fast. It handles about 1million lines a second on my i5.

Deduped lists are saved as "originalfilename_deduped.txt" in whatever folder(s) the original file was based.

@AlexR i'll take a look at your project idea tomorrow.

ron · May 2013

@m3ownz - Thanks for sharing! I can't wait to give that a whirl.

Ozz · May 2013

sounds awesome. thanks a lot.

m3ownz · May 2013

No problem, hopefully is usefull. The GUI might be a little rough as it was originally written as a part of a larger console app.

If its not obvious, you can hold shift to select multiple files, and can keep pressing the add button to add more files from different folders, and the tool with then go through each in order.

doubleup · May 2013

@m3ownz Cheers, will take a look

team74 · May 2013

@m3ownz nice one mate! That works really well.

It's a shame you can't save them with the same file name, but the time you've saved me has been spent on making a ubot to rename the files.

*round of applause*

team74 · May 2013

Oh man this is awesome, it took 150 seconds to clear 22 million dupes from my identified lists!

BTW I put the rar and the exe through virus total and both come back with 0 results (clean).

m3ownz · May 2013

@team74 added overwrite option for you:

https://dl.dropboxusercontent.com/u/14851340/SimpleDeDupe.rar

Be careful not to accidently remove duplicate domains on a blog comment list or similar with this option checked. Might be an idea to create backups 1st.

team74 · May 2013

Sweeeeeeeet.

yeah I'm proper paranoid and cautious about my link lists I have copies everywhere, even my own cloud and stuff but thanks for the warning.

m3ownz · May 2013

Yeah, thought id just mention it just in case!

Glad it worked out well for you, its certainly better than using notepad++!

Worth noting that it reads a complete file into memory, so if you have a particularly massive file (eg larger than your system ram) it will fail, but i do not realistically see that happening.

Sequential list de-duping

Comments