I'm an avid user of GSA SER and totally love the software that @Sven
has made. As you know, as your lists build up, so does all the garbage in them. With this post, I hope to tell you how to keep your lists clean as a whistle.
If you’re having low LPM problems, this may fix everything for you.
First off, I have to stress this; this takes lots of work and time to complete, so if you are one of those users who don't like lot's of work, or don't mind getting all those download failed messages in SER, you can stop reading now. Thank you.
1) First thing you want to do is create a project for each engine type that SER has. (I told you this is going to be lot's of work).
The reason for this, SER sometimes mis-identifies an engine if you have multiple engines selected that have the same footprints on the page. A great example of this is the General Blogs engine, and a few Article engines. I’ve had many get identified as General Blogs, but they should have been identified as Zendesk for example.
Make sure that you do not put any limits on these projects such as pausing after X amount of verified’s or something.
2) Once you have all the projects created, select them all then Right Click > Modify Project > Move To Group > New. Name this group (Sorting Project). Now all those projects will be in a nice group for you.
3) Once that's completed, open up Scrapebox and open the Dup Remove add on. What you want to do now, is merge all the site lists together. Merge all your Identified and save that as GSA Identified Site Lists. Do this for all your Site lists folders. With that same tool, remove all the duplicate URL’s also.
If you have huge site lists, over 1 million URL’s in each of the files that you just created, simple split them up in to 600,000k chunks.
4) Once you have your files created, import each one in to Scrapebox and remove all unwanted extentions. EG: .gif, .pdf, .jpg, ETC. Then Export that cleaned URL list as the same file name you just cleaned, and Scrapebox will over write it with the cleaned list.
Taking this one step further. Trim each site to root domain, remove duplicate domains, and save that as the same file name + Domains Only. Doing this helps if the URL to an article as an example does not exist, this will give you a 404 error message, having the domain only, SER should identify the site still.
5) Once you have all your new clean lists created, DELETE all of GSA SER site lists files. This is very important to do, if you don’t, it just defeats the purpose of doing all this.
6) Now import each of the files in to each of the projects you created and let SER go to work.
I do this every couple of months when I do my full clean up of SER and this has helped immensely.
I hope this is easy to understand and helps you out.
If you have any questions, simply post them, and I will try my best to help.