Sequential list de-duping
Hi, I'll start by saying thanks for the awesome software, I've got multiple licenses and I've made a lot of money with GSA so cheers!
I've mentioned this before, but the de-duping URLs function hangs, which means I have to go through it selecting checkboxes one at a time. I've spent many hours doing this until now.
I also use notepad++ to dedupe some of the largest lists just to ease the load so it doesn't make GSA hang or crash.
This is probably because I have 40 million+ identified URLs, but in any case GSA SER seems to attempt opening all the selected site lists at once, the hard-drive read/write speeds hit max and then the server just seems to stop, and the processes are no longer in the list and it has clearly failed.
@ Sven, is there any way you can make the program go through each site list, one by one in sequential order so it's not a bottleneck.
Or at least make it easier to select groups of list, rather than one at a time, please!
I've mentioned this before, but the de-duping URLs function hangs, which means I have to go through it selecting checkboxes one at a time. I've spent many hours doing this until now.
I also use notepad++ to dedupe some of the largest lists just to ease the load so it doesn't make GSA hang or crash.
This is probably because I have 40 million+ identified URLs, but in any case GSA SER seems to attempt opening all the selected site lists at once, the hard-drive read/write speeds hit max and then the server just seems to stop, and the processes are no longer in the list and it has clearly failed.
@ Sven, is there any way you can make the program go through each site list, one by one in sequential order so it's not a bottleneck.
Or at least make it easier to select groups of list, rather than one at a time, please!
Tagged:
Comments
I've tried on 2 dedis, both high spec (8 cores at 3.3GHz+), but you can see in the resource monitor that the bottle neck is the read/write speed to the hard-drive.
I've had another look, and I can't see where where you do that, can you point that out to me please?
Also, I am familiar with duperemove but it's no better than notepad++ (except for larger files) because you still have to do them one at a time.
@team74 My mistake, I didn’t mean just the ‘identified’ folder, but all folders. Thinking about it further, do you have ‘save identified sites to’ ticked within the advanced section of options? As obviously, any and every single link GSA SER comes across, it’ll write to it’s relevant file within the folder, and your list will be massive with a lot of duplicates. I only have submitted and verified ticked. I use to also save identified, but that use to produce the issue you mention, so maybe that has something to do with how longs it’s taking you to dedupe.
Your right about scrapebox’s dedupe, but as I mention, I only use it on one file, as it’s very large. When it comes to the large files (300MB+ for example), I’d use a tool such as scrapebox’s duplicate remover addon, but anything less than that I’d just allow GSA SER to do, as in my experience, it’ll fly through the lot pretty quickly.
Yes I save all urls, even the failed ones.
I like to keep the identifieds too so I can put them through XRumer.
5GB of success and verifieds eh? Pretty impressive. Cheers.
@AlexR
There is no improvement in dedupe. This is just for the people who like things being organized. It is not improving speed or submission.
Removing one URL from a list when it is added to another would mean that the program must keep the list in memory...a huge waste of resources...so no ... never.
Background task: Sorry also a no, it would again mean to load too many data into memory. And you never know when a thread needs how much amount of memory to go on.
It's a shame you can't save them with the same file name, but the time you've saved me has been spent on making a ubot to rename the files.
*round of applause*
BTW I put the rar and the exe through virus total and both come back with 0 results (clean).
yeah I'm proper paranoid and cautious about my link lists I have copies everywhere, even my own cloud and stuff but thanks for the warning.