How to keep GSA SER Lists Clean And Increase Your LPM

Trevor_Bandura · October 2013

I'm an avid user of GSA SER and totally love the software that @Svenhas made. As you know, as your lists build up, so does all the garbage in them. With this post, I hope to tell you how to keep your lists clean as a whistle.

If you’re having low LPM problems, this may fix everything for you.

First off, I have to stress this; this takes lots of work and time to complete, so if you are one of those users who don't like lot's of work, or don't mind getting all those download failed messages in SER, you can stop reading now. Thank you.

1) First thing you want to do is create a project for each engine type that SER has. (I told you this is going to be lot's of work).

The reason for this, SER sometimes mis-identifies an engine if you have multiple engines selected that have the same footprints on the page. A great example of this is the General Blogs engine, and a few Article engines. I’ve had many get identified as General Blogs, but they should have been identified as Zendesk for example.

Make sure that you do not put any limits on these projects such as pausing after X amount of verified’s or something.

2) Once you have all the projects created, select them all then Right Click > Modify Project > Move To Group > New. Name this group (Sorting Project). Now all those projects will be in a nice group for you.

3) Once that's completed, open up Scrapebox and open the Dup Remove add on. What you want to do now, is merge all the site lists together. Merge all your Identified and save that as GSA Identified Site Lists. Do this for all your Site lists folders. With that same tool, remove all the duplicate URL’s also.

If you have huge site lists, over 1 million URL’s in each of the files that you just created, simple split them up in to 600,000k chunks.

4) Once you have your files created, import each one in to Scrapebox and remove all unwanted extentions. EG: .gif, .pdf, .jpg, ETC. Then Export that cleaned URL list as the same file name you just cleaned, and Scrapebox will over write it with the cleaned list.

Taking this one step further. Trim each site to root domain, remove duplicate domains, and save that as the same file name + Domains Only. Doing this helps if the URL to an article as an example does not exist, this will give you a 404 error message, having the domain only, SER should identify the site still.

5) Once you have all your new clean lists created, DELETE all of GSA SER site lists files. This is very important to do, if you don’t, it just defeats the purpose of doing all this.

6) Now import each of the files in to each of the projects you created and let SER go to work.

I do this every couple of months when I do my full clean up of SER and this has helped immensely.

I hope this is easy to understand and helps you out.

If you have any questions, simply post them, and I will try my best to help.

hans51 · October 2013

sounds good but actually could be a future feature request to have a clean up function in SER
dedup domain already exists as single function for ALL projects at once
may be a "strip to root" as exists in SB could be added to SER as an additional feature ?
however
if for example a domain has its wiki either on a subdomain or sub/sub-sub folder
then a stripped to root may never be able to find an entry point for a footprint = some sites have NO link from root to far inner sections or sub.domains

may be a re-identify and resort iniside SER
if engines have been modified or added

maintenance work to be done by all users every several months = thousands of man-hours
vs
maintenance work done for 1 or 2 clicks
or eve automated every x weeks

another option MAY be (??)
export ALL
> options > tools > export site lists (is in non-human readable .sl format = convert .sl format to human readable format as it is already saved in the actual files

then export the 4 categories with a single click
> export ALL
and then let SER run again to import and sort / re-identify from scratch

but there might be a need to export / import in a format that is neutral = like txt
= a new feature to to do so may be needed
to force SER from scratch to identify / sort all = without export/re-import could be another option
and resort where needed

for now export only seems to exist as .sl
while neutral import exists as .txt file

the folder however contain plain language txt URLs
someone with knowledge in Msoft script languages could do a simple script to list ALL URLs in our current list folders and sub-folders at once (I currently do so in Linux to join and sort and filter SB lists to be imported by SER for identification
500k URLs approx 2-3 seconds (without engines of course) - the engine work then shows 65-90+ % identified depending on list

clean up of lists certainly a need
the procedure however may need to be more efficient and more automated / time saving to free for other more important work ...

Trevor_Bandura · October 2013

When Scrapebox trims to root, it trims to either the Sub domain or domain, this depends on the format of the URL.

To take this one step further, like you mentioned about the Wikis, SB also has a feature to trim to last folder. This could be used also.

So you would end up creating 3 different versions of the URL lists.

1) Full URL

May contain 404 pages if articles, bookmarks, etc are deleted. These pages may or may not have the needed footprints to be identified.

2) Domain/Subdomain

Footprints should be there for easy identification through SER.

3) Trim to last folder.

This I think would get rid of any 404 error messages from articles being deleted from the target sites. These should also be easy for SER to identify the platform the site is running.

Having this automated would be great.For example, if SER finds an already identified site, or a site that was posted to before, but now returns a 404 error message, may automatically trim to the last folder and re-identify or something. If the re-identify does not work, simply delete it from the site lists. This would keep everything nice and clean.

But to tell you the truth, I really don't mind doing all this manually every month or so.

Username · October 2013

I don't get this part: "6) Now import each of the files in to each of the projects you created and let SER go to work."

What do you mean by "import each of the files"? After doing the cleanup using ScrapeBox, won't there be only one file left (GSA Identified Site Lists)?

Trevor_Bandura · October 2013

1) You want to to this clean up process for all your site list folders. SER has 4 different folders for you to save your site lists.

Failed

Submitted

Verified

Identified

I don't use all the options for saving site list URL's, but some members save everything.

2) Right click on project > Import URL's > From file.

hans51 · October 2013

That with one folder down may be worth testing
next run for link extraction I may save the 404
and test the separately

- for link extraction o a separate run
- for SER to find footprints

I currently let a shell script to the footprint stuff on a linux machine
there the footprint does NOT have to be at the end of a URL but anywhere within the URL
its highly accurate and the last run about 2 3 hrs ago I had some 800'000 URLs from SB done in may be 5 seconds
the only the final test is done by SER when importing either into a project / T or into global list

the prefiltering eliminates approx 90% or so harvested URLs by SB that contain no footprint

since i have SER some 2+ months ago, I have zero free time and I really like all to be automated that my computer can do faster and better

YOUR entire procedure could be completely automated and activated on a single click

currently I think about deleting all URLs in the global list
and re-importing ALL harvested ones from scratch. I have a few hundred thousand harvested URLs all stored for reuse/re-import
may be that could be even faster than a clean up ...

miki · October 2013

> 1) First thing you want to do is create a project for each engine type that SER has. (I told you this is going to be lot's of work).

Do you mean for each group ?

Like one project for Blog Comments another for Directory, etc ?

Or new project for each and every engine ??

DonAntonio · October 2013

Does it make sense to do this if I had GSA running for 1 week?

Trevor_Bandura · October 2013

@miki Yes I mean Group. Sorry.

@DonAntonio No, I would wait till you have been running it for a few months. Unless when running using site lists only you're getting lots of "No Engine Matches" messages in the log for sites that you have posted to before and you know should be good. Those messages mean that most likely the link that SER saved as a verified link, that article or what have you has been deleted.

DonAntonio · October 2013

Great! thanks for clarification.

Do you build your lists through GSA scraper only, or use scrapebox/gscraper too?

Trevor_Bandura · October 2013

Through SER and as of late, mostly scrapebox. Proxies for me get banned to fast in SER. But I have found a way to get thousands of new targets daily.

vet333 · October 2013

Hi Trevor_Bandura

How can you get thousands of new targets Proxies daily?

JudderMan · November 2013

@sven - what @Trevor_Bandura and @hans51 have mentioned would be awesome if it could be implemented....any ideas if it could be added?

I've been doing it manually all day, cleaning up, removing duplicates, basically organising all of the files and scraping new ones with Scrapebox, and have had to stop all projects in order to do so as it was slowing my VPS down, which isn't ideal.

useruser1 · November 2013

"3 different
versions of the URL" this should be! now SER try find engine only exact URL :-(

hadoken · March 2014

@Trevor_Bandura thanks for this post.

When you import the target lists do you import them individually per project or into all projects at once? I did them all at once but that seems to have imported the entire list into each project.

Ferryman · March 2014

if you add them via rightclick it adds the lis once per project.

skyking · September 2019

Trevor_Bandura said:

I'm an avid user of GSA SER and totally love the software that @Svenhas made. As you know, as your lists build up, so does all the garbage in them. With this post, I hope to tell you how to keep your lists clean as a whistle.

If you’re having low LPM problems, this may fix everything for you.

First off, I have to stress this; this takes lots of work and time to complete, so if you are one of those users who don't like lot's of work, or don't mind getting all those download failed messages in SER, you can stop reading now. Thank you.

1) First thing you want to do is create a project for each engine type that SER has. (I told you this is going to be lot's of work).

The reason for this, SER sometimes mis-identifies an engine if you have multiple engines selected that have the same footprints on the page. A great example of this is the General Blogs engine, and a few Article engines. I’ve had many get identified as General Blogs, but they should have been identified as Zendesk for example.

Make sure that you do not put any limits on these projects such as pausing after X amount of verified’s or something.

2) Once you have all the projects created, select them all then Right Click > Modify Project > Move To Group > New. Name this group (Sorting Project). Now all those projects will be in a nice group for you.

3) Once that's completed, open up Scrapebox and open the Dup Remove add on. What you want to do now, is merge all the site lists together. Merge all your Identified and save that as GSA Identified Site Lists. Do this for all your Site lists folders. With that same tool, remove all the duplicate URL’s also.

If you have huge site lists, over 1 million URL’s in each of the files that you just created, simple split them up in to 600,000k chunks.

4) Once you have your files created, import each one in to Scrapebox and remove all unwanted extentions. EG: .gif, .pdf, .jpg, ETC. Then Export that cleaned URL list as the same file name you just cleaned, and Scrapebox will over write it with the cleaned list.

Taking this one step further. Trim each site to root domain, remove duplicate domains, and save that as the same file name + Domains Only. Doing this helps if the URL to an article as an example does not exist, this will give you a 404 error message, having the domain only, SER should identify the site still.

5) Once you have all your new clean lists created, DELETE all of GSA SER site lists files. This is very important to do, if you don’t, it just defeats the purpose of doing all this.

6) Now import each of the files in to each of the projects you created and let SER go to work.

I do this every couple of months when I do my full clean up of SER and this has helped immensely.

I hope this is easy to understand and helps you out.

If you have any questions, simply post them, and I will try my best to help.

I have a list of 3 million links that I bought, how am i supposed to seperate them by engine?