A REAL feature that should be in GSA PI

WazonWazon USA
edited August 2015 in GSA Platform Identifier


After reading this post

The reason we have the Min./Max. file size because sometimes you get pages that are maybe 1 or 2KB, and that probably means that it’s probably a 404 page that says something like “this is a 404 page, click here to go to homepage.”

Obviously that’s no use to us. We want a page that actually has some HTML on it. So that’s why we set the Min. file size to 10KB min. You can set a lower if you want but you’ve been warned!

I set Max. size to 200KB, and this is large because the file size doesn’t include any media that’s on the page. Images and videos are not included. 200KB is just the HTML and CSS and Javascript (and any text content) limit."

GSA PI can filter out all the URLs with less than 1k, 2kb or less than 10kb and max size to 200kb example. This will eliminate most of the sites with 404 errors!


  • @OP
    Post in your link is about footprints building and its about footprint factory tool (anway this tool is nothing perfect), its not usable in any way for PI or SER.
    GSA PI wont (probably) identify error pages. To check how big is page,  GSA PI need to download it anyway, so BW is used. Why not try to identify it, when 50% of job is done?

    If you wont recognize sites with more than 200kb you will skip all blog comments websites, and probably many others. Thats why we in SER download websites up to 5-10MB, same with PI. Limiting size of websites downloaded can cut your identified/verified list by half, you probably dont want this.

  • s4nt0ss4nt0s Houston, Texas
    404's shouldn't really pass any engine filters and are pretty small in size. We will look into this but not sure it will help much.
