I did a few tests with this module, and so I see how the process works now.
Is there any hazard in adding new strings to existing footprints? The only conceivably bad scenario I can imagine would involve adding something that really isn't worthwhile, as in too few occurrences out in the wild, thereby wasting searches that could have been performed with better strings.
What do experienced Footprint Studio GSA users suggest, as far as a reasonably useful sample size?
I guess this is actually a good use of purchased lists that I didn't realize until now...if you're using a Global List that is purchased, that is. Even if the targets are spammed, they can still add to the total used for samples!
Also, SER users: Do you usually work with this module while SER is actually running projects that are in normal Active mode, or do you stop all Active projects, or decrease number of threads, or something? ? Thanks!
Also...has anyone actually replaced, not just added strings to footprints?
Has anyone actually gone as far as to removed any of the boiler-plate strings bc they found them to, in the end, impede GSA (or whatever program) from scraping at its mos effective?
What would the criteria be for that? How would we determine a string isn't one of the best possible choices?
Improving search footprints and engine 'page must have' strings is something any serious SER user should be doing. A lot of the engines have been around for years, and as platforms evolve over time so will the boilerplate text that tends to get used as a footprint. Also, as these footprints have been hammered by thousands (tens/hundreds of thousands) of SER users over the years, site owners (at least the serious ones) have probably tried to remove them all from their sites.
It's the same with the 'page must have' strings that engines use to identify platforms. When I was fixing up the Moodle engine a while back, it was mainly new 'page must have' strings that I updated so that SER would be able to identify Moodle sites.
Both search footprints and 'page must have' strings can be hard to nail down for some platforms. Sometimes they just don't really have much identifying code in the html, or boilerplate text that can be used to search for them.
If you are targeting non-English sites, you should definetely add footprints. An easy method is to look at the footprints in SER and then check the language files of the respective platform on github. This gives you many more target sites and actual links.
Comments
It's the same with the 'page must have' strings that engines use to identify platforms. When I was fixing up the Moodle engine a while back, it was mainly new 'page must have' strings that I updated so that SER would be able to identify Moodle sites.
Both search footprints and 'page must have' strings can be hard to nail down for some platforms. Sometimes they just don't really have much identifying code in the html, or boilerplate text that can be used to search for them.
If you are targeting non-English sites, you should definetely add footprints. An easy method is to look at the footprints in SER and then check the language files of the respective platform on github. This gives you many more target sites and actual links.