Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

How to keep GSA SER Lists Clean And Increase Your LPM

Trevor_BanduraTrevor_Bandura 267,647 NEW GSA SER Verified List
edited October 2013 in GSA Search Engine Ranker
I'm an avid user of GSA SER and totally love the software that @Sven ;has made. As you know, as your lists build up, so does all the garbage in them. With this post, I hope to tell you how to keep your lists clean as a whistle.

If you’re having low LPM problems, this may fix everything for you.

First off, I have to stress this; this takes lots of work and time to complete, so if you are one of those users who don't like lot's of work, or don't mind getting all those download failed messages in SER, you can stop reading now. Thank you.

1) First thing you want to do is create a project for each engine type that SER has. (I told you this is going to be lot's of work).

The reason for this, SER sometimes mis-identifies an engine if you have multiple engines selected that have the same footprints on the page. A great example of this is the General Blogs engine, and a few Article engines. I’ve had many get identified as General Blogs, but they should have been identified as Zendesk for example.

Make sure that you do not put any limits on these projects such as pausing after X amount of verified’s or something.

2) Once you have all the projects created, select them all then Right Click > Modify Project > Move To Group > New. Name this group (Sorting Project). Now all those projects will be in a nice group for you.

3) Once that's completed, open up Scrapebox and open the Dup Remove add on. What you want to do now, is merge all the site lists together. Merge all your Identified and save that as GSA Identified Site Lists. Do this for all your Site lists folders. With that same tool, remove all the duplicate URL’s also.

If you have huge site lists, over 1 million URL’s in each of the files that you just created, simple split them up in to 600,000k chunks.

4) Once you have your files created, import each one in to Scrapebox and remove all unwanted extentions. EG: .gif, .pdf, .jpg, ETC. Then Export that cleaned URL list as the same file name you just cleaned, and Scrapebox will over write it with the cleaned list.

Taking this one step further. Trim each site to root domain, remove duplicate domains, and save that as the same file name + Domains Only. Doing this helps if the URL to an article as an example does not exist, this will give you a 404 error message, having the domain only, SER should identify the site still.

5) Once you have all your new clean lists created, DELETE all of GSA SER site lists files. This is very important to do, if you don’t, it just defeats the purpose of doing all this.

6) Now import each of the files in to each of the projects you created and let SER go to work.

I do this every couple of months when I do my full clean up of SER and this has helped immensely.

I hope this is easy to understand and helps you out. 

If you have any questions, simply post them, and I will try my best to help.

Answers

  • sounds good but actually could be a future feature request to have a clean up function in SER
    dedup domain already exists as single function for ALL projects at once
    may be a "strip to root" as exists in SB could be added to SER as an additional feature ?
    however
    if for example a domain has its wiki either on a subdomain or sub/sub-sub folder
    then a stripped to root may never be able to find an entry point for a footprint = some sites have NO link from root to far inner sections or sub.domains

    may be a re-identify and resort iniside SER
    if engines have been modified or added

    maintenance work to be done by all users every several months = thousands of man-hours
    vs
    maintenance work done for 1 or 2 clicks
    or eve automated every x weeks

    another option MAY be (??)
    export ALL
    > options > tools > export site lists (is in non-human readable .sl format = convert .sl format to human readable format as it is already saved in the actual files

    then export the 4 categories with a single click
    > export ALL
    and then let SER run again to import and sort / re-identify from scratch

    but there might be a need to export / import in a format that is neutral = like txt
    = a new feature to to do so may be needed
    to force SER from scratch to identify / sort all = without export/re-import could be another option
    and resort where needed

    for now export only seems to exist as .sl
    while neutral import exists as .txt file

    the folder however contain plain language txt URLs
    someone with knowledge in Msoft script languages could do a simple script to list ALL URLs in our current list folders and sub-folders at once (I currently do so in Linux to  join and sort and filter SB lists to be imported by SER for identification
    500k URLs approx 2-3 seconds (without engines of course) - the engine work then shows 65-90+ % identified depending on list

    clean up of lists certainly a need
    the procedure however may need to be more efficient and more automated / time saving to free for other more important work ...
  • Trevor_BanduraTrevor_Bandura 267,647 NEW GSA SER Verified List
    When Scrapebox trims to root, it trims to either the Sub domain or domain, this depends on the format of the URL.

    To take this one step further, like you mentioned about the Wikis, SB also has a feature to trim to last folder. This could be used also.

    So you would end up creating 3 different versions of the URL lists.

    1) Full URL

    May contain 404 pages if articles, bookmarks, etc are deleted. These pages may or may not have the needed footprints to be identified.

    2) Domain/Subdomain

    Footprints should be there for easy identification through SER.

    3) Trim to last folder.

    This I think would get rid of any 404 error messages from articles being deleted from the target sites. These should also be easy for SER to identify the platform the site is running.

    Having this automated would be great.For example, if SER finds an already identified site, or a site that was posted to before, but now returns a 404 error message, may automatically trim to the last folder and re-identify or something. If the re-identify does not work, simply delete it from the site lists. This would keep everything nice and clean.

    But to tell you the truth, I really don't mind doing all this manually every month or so.
  • edited October 2013
    I don't get this part: "6) Now import each of the files in to each of the projects you created and let SER go to work."

    What do you mean by "import each of the files"? After doing the cleanup using ScrapeBox, won't there be only one file left (GSA Identified Site Lists)?
  • Trevor_BanduraTrevor_Bandura 267,647 NEW GSA SER Verified List
    edited October 2013
    1) You want to to this clean up process for all your site list folders. SER has 4 different folders for you to save your site lists.

    Failed
    Submitted
    Verified
    Identified

    I don't use all the options for saving site list URL's, but some members save everything.

    2) Right click on project > Import URL's > From file.
  • That with one folder down may be worth testing
    next run for link extraction I may save the 404
    and test the separately

    - for link extraction o a separate run
    - for SER to find footprints

    I currently let a shell script to the footprint stuff on a linux machine
    there the footprint does NOT have to be at the end of a URL but anywhere within the URL
    its highly accurate and the last run about 2 3 hrs ago I had some 800'000 URLs from SB done in may be 5 seconds
    the only the final test is done by SER when importing either into a project / T or into global list

    the prefiltering eliminates approx 90% or so harvested URLs by SB that contain no footprint

    since i have SER some 2+ months ago, I have zero free time and I really like all to be automated that my computer can do faster and better

    YOUR entire procedure could be completely automated and activated on a single click

    currently I think about deleting all URLs in the global list
    and re-importing ALL harvested ones from scratch. I have a few hundred thousand harvested URLs all stored for reuse/re-import
    may be that could be even faster than a clean up ...
  • > 1) First thing you want to do is create a project for each engine type that SER has. (I told you this is going to be lot's of work).

    Do you mean for each group ?
    Like one project for Blog Comments another for Directory, etc ?
    Or new project for each and every engine ??
  • Does it make sense to do this if I had GSA running for 1 week?
  • Trevor_BanduraTrevor_Bandura 267,647 NEW GSA SER Verified List
    @miki Yes I mean Group. Sorry.

    @DonAntonio No, I would wait till you have been running it for a few months. Unless when running using site lists only you're getting lots of "No Engine Matches" messages in the log for sites that you have posted to before and you know should be good. Those messages mean that most likely the link that SER saved as a verified link, that article or what have you has been deleted.
  • Great! thanks for clarification.
    Do you build your lists through GSA scraper only, or use scrapebox/gscraper too?
  • Trevor_BanduraTrevor_Bandura 267,647 NEW GSA SER Verified List
    Through SER and as of late, mostly scrapebox. Proxies for me get banned to fast in SER. But I have found a way to get thousands of new targets daily.
  • edited October 2013
    Hi Trevor_Bandura 

    How can you get thousands of new targets Proxies daily?
  • @sven -  what @Trevor_Bandura and @hans51 have mentioned would be awesome if it could be implemented....any ideas if it could be added?

    I've been doing it manually all day, cleaning up, removing duplicates, basically organising all of the files and scraping new ones with Scrapebox, and have had to stop all projects in order to do so as it was slowing my VPS down, which isn't ideal.
  • "3 different
    versions of the URL" this should be! now SER try find engine only exact URL :-(
  • @Trevor_Bandura thanks for this post.

    When you import the target lists do you import them individually per project or into all projects at once? I did them all at once but that seems to have imported the entire list into each project.
  • if you add them via rightclick it adds the lis once per project.
  • I'm an avid user of GSA SER and totally love the software that @Sven ;has made. As you know, as your lists build up, so does all the garbage in them. With this post, I hope to tell you how to keep your lists clean as a whistle.

    If you’re having low LPM problems, this may fix everything for you.

    First off, I have to stress this; this takes lots of work and time to complete, so if you are one of those users who don't like lot's of work, or don't mind getting all those download failed messages in SER, you can stop reading now. Thank you.

    1) First thing you want to do is create a project for each engine type that SER has. (I told you this is going to be lot's of work).

    The reason for this, SER sometimes mis-identifies an engine if you have multiple engines selected that have the same footprints on the page. A great example of this is the General Blogs engine, and a few Article engines. I’ve had many get identified as General Blogs, but they should have been identified as Zendesk for example.

    Make sure that you do not put any limits on these projects such as pausing after X amount of verified’s or something.

    2) Once you have all the projects created, select them all then Right Click > Modify Project > Move To Group > New. Name this group (Sorting Project). Now all those projects will be in a nice group for you.

    3) Once that's completed, open up Scrapebox and open the Dup Remove add on. What you want to do now, is merge all the site lists together. Merge all your Identified and save that as GSA Identified Site Lists. Do this for all your Site lists folders. With that same tool, remove all the duplicate URL’s also.

    If you have huge site lists, over 1 million URL’s in each of the files that you just created, simple split them up in to 600,000k chunks.

    4) Once you have your files created, import each one in to Scrapebox and remove all unwanted extentions. EG: .gif, .pdf, .jpg, ETC. Then Export that cleaned URL list as the same file name you just cleaned, and Scrapebox will over write it with the cleaned list.

    Taking this one step further. Trim each site to root domain, remove duplicate domains, and save that as the same file name + Domains Only. Doing this helps if the URL to an article as an example does not exist, this will give you a 404 error message, having the domain only, SER should identify the site still.

    5) Once you have all your new clean lists created, DELETE all of GSA SER site lists files. This is very important to do, if you don’t, it just defeats the purpose of doing all this.

    6) Now import each of the files in to each of the projects you created and let SER go to work.

    I do this every couple of months when I do my full clean up of SER and this has helped immensely.

    I hope this is easy to understand and helps you out. 

    If you have any questions, simply post them, and I will try my best to help.
    I have a list of 3 million links that I bought, how am i supposed to seperate them by engine?
Sign In or Register to comment.