How to keep GSA SER Lists Clean And Increase Your LPM
Trevor_Bandura
267,647 NEW GSA SER Verified List
I'm an avid user of GSA SER and totally love the software that @Svenhas made. As you know, as your lists build up, so does all the garbage in them. With this post, I hope to tell you how to keep your lists clean as a whistle.
If you’re having low LPM problems, this may fix everything for you.
First off, I have to stress this; this takes lots of work and time to complete, so if you are one of those users who don't like lot's of work, or don't mind getting all those download failed messages in SER, you can stop reading now. Thank you.
1) First thing you want to do is create a project for each engine type that SER has. (I told you this is going to be lot's of work).
The reason for this, SER sometimes mis-identifies an engine if you have multiple engines selected that have the same footprints on the page. A great example of this is the General Blogs engine, and a few Article engines. I’ve had many get identified as General Blogs, but they should have been identified as Zendesk for example.
Make sure that you do not put any limits on these projects such as pausing after X amount of verified’s or something.
2) Once you have all the projects created, select them all then Right Click > Modify Project > Move To Group > New. Name this group (Sorting Project). Now all those projects will be in a nice group for you.
3) Once that's completed, open up Scrapebox and open the Dup Remove add on. What you want to do now, is merge all the site lists together. Merge all your Identified and save that as GSA Identified Site Lists. Do this for all your Site lists folders. With that same tool, remove all the duplicate URL’s also.
If you have huge site lists, over 1 million URL’s in each of the files that you just created, simple split them up in to 600,000k chunks.
4) Once you have your files created, import each one in to Scrapebox and remove all unwanted extentions. EG: .gif, .pdf, .jpg, ETC. Then Export that cleaned URL list as the same file name you just cleaned, and Scrapebox will over write it with the cleaned list.
Taking this one step further. Trim each site to root domain, remove duplicate domains, and save that as the same file name + Domains Only. Doing this helps if the URL to an article as an example does not exist, this will give you a 404 error message, having the domain only, SER should identify the site still.
5) Once you have all your new clean lists created, DELETE all of GSA SER site lists files. This is very important to do, if you don’t, it just defeats the purpose of doing all this.
6) Now import each of the files in to each of the projects you created and let SER go to work.
I do this every couple of months when I do my full clean up of SER and this has helped immensely.
I hope this is easy to understand and helps you out.
If you have any questions, simply post them, and I will try my best to help.
Comments
dedup domain already exists as single function for ALL projects at once
may be a "strip to root" as exists in SB could be added to SER as an additional feature ?
however
if for example a domain has its wiki either on a subdomain or sub/sub-sub folder
then a stripped to root may never be able to find an entry point for a footprint = some sites have NO link from root to far inner sections or sub.domains
may be a re-identify and resort iniside SER
if engines have been modified or added
maintenance work to be done by all users every several months = thousands of man-hours
vs
maintenance work done for 1 or 2 clicks
or eve automated every x weeks
another option MAY be (??)
export ALL
> options > tools > export site lists (is in non-human readable .sl format = convert .sl format to human readable format as it is already saved in the actual files
then export the 4 categories with a single click
> export ALL
and then let SER run again to import and sort / re-identify from scratch
but there might be a need to export / import in a format that is neutral = like txt
= a new feature to to do so may be needed
to force SER from scratch to identify / sort all = without export/re-import could be another option
and resort where needed
for now export only seems to exist as .sl
while neutral import exists as .txt file
the folder however contain plain language txt URLs
someone with knowledge in Msoft script languages could do a simple script to list ALL URLs in our current list folders and sub-folders at once (I currently do so in Linux to join and sort and filter SB lists to be imported by SER for identification
500k URLs approx 2-3 seconds (without engines of course) - the engine work then shows 65-90+ % identified depending on list
clean up of lists certainly a need
the procedure however may need to be more efficient and more automated / time saving to free for other more important work ...
What do you mean by "import each of the files"? After doing the cleanup using ScrapeBox, won't there be only one file left (GSA Identified Site Lists)?
next run for link extraction I may save the 404
and test the separately
- for link extraction o a separate run
- for SER to find footprints
I currently let a shell script to the footprint stuff on a linux machine
there the footprint does NOT have to be at the end of a URL but anywhere within the URL
its highly accurate and the last run about 2 3 hrs ago I had some 800'000 URLs from SB done in may be 5 seconds
the only the final test is done by SER when importing either into a project / T or into global list
the prefiltering eliminates approx 90% or so harvested URLs by SB that contain no footprint
since i have SER some 2+ months ago, I have zero free time and I really like all to be automated that my computer can do faster and better
YOUR entire procedure could be completely automated and activated on a single click
currently I think about deleting all URLs in the global list
and re-importing ALL harvested ones from scratch. I have a few hundred thousand harvested URLs all stored for reuse/re-import
may be that could be even faster than a clean up ...
versions of the URL" this should be! now SER try find engine only exact URL :-(
When you import the target lists do you import them individually per project or into all projects at once? I did them all at once but that seems to have imported the entire list into each project.