Skip to content

Dead Page Checker

edited September 2013 in Feature Requests
Hi,

in GSA SER i'd really love to have a function that removes all dead pages from site lists.

Over time the list gets gigantic and from watching the log/checking with scrapebox more than 50% are dead now, which means 50% of the pages that SER tries to process are for nothing.

Of course i can do it with Scrapebox, but thats a lot of work because you have to load each engine seperately and i think an alive check is a simple function (says the non-programmer ;) ).

Regards

Comments

  • SvenSven www.GSA-Online.de
    It's on the to-do list but other things have priority right now.
  • yes like platform trainer =)
  • edited September 2013
    ALL that would be needed is to permanently remove (and put into a SER-internal blacklist all those sites that have more than x failed submission attempt OR that return a 404 or 5xx server error
    or
    remove all PR N/A and 0 from list
    in my experience LOW PR N/A and 0 typically are on FREE hosting = limited bandwidth (usually 4-6GB/month) and many/most of such free host sites have their months quota exceeded by MID-of each months = resulting in a 404 or any 5xx server response depending on HOST admin configuration

    if  you INCREASE the minimum PR requirements for your T and projects to 1 or higher,
    you encounter less dead sites

    for the current LIVE check
    may be a feature request to SB could help
    or
    a manual check of all, folder by folder

    the FASTEST UN-scientific method to test page existence is a single PING to a PAGE
    rather than a regular LIVE check which asks for server headers for a page and results in a server response
    200 = OK or anything else = failed
  • good points hans, maybe the scrapebox team can do a function that bulk checks+deletes. im gonna write them an email.

    But i think removing all PR0 pages makes absolutely no sense, because every article, wiki etc. is going to have PR0 (because in the verified folder it safes the actual article and NOT the root domain).

    And 50% of all internet pages are PR0 which means u reduce your targets by half...

    Regards
  • edited September 2013
    PR0 on site usually is either:

    • a new site needing to grow and prove trustworthiness
    • or
    • an old site forgotten to grow, often on cheap slow free-hosting and lrarge part of the month down because of free quota exceeded = resulting in eith3er PR N/A or PR0

    in my experience the second option is more frequent and I usually prefer to start at PR1 for T and PR 3 for projects

    a temporary PR limit-reduction I ONLY use if I need fast extra BLs

    there are LOTS of high PR sites out there, just be creative to find using SB with country specific search or without search at all using creative SB methods

    of course

    it depends on whether you need hundred thousands or millions of links

    or just a few ten thousands 

    free hosted sites come and go

    higher PR sites = ppl have invested efforts + time + $ and are more likely to stay for years or longer

    = resulting in lasting BLs vs volatile BLs

    and

    scrapebox of course has a LIVE check 

    if EVER you have a little learning time left

    do a thousand or more account verifications MANUALLY - including low PR sites (I did some 20k some 3 yrs ago - just for learning / studying the article-site environment in www)

    you find among such low PR sites numerous errors such as:

    • various mysql db errors
    • NON-configured email notification
    • NO sender in email, NO domain mentioned in email !!
    • NO confirmation URL or code in email
    • 404 or various 5xx errors
    • forms missing
    • form fields missing
    • submission of forms filled NOT working
    • submit buttons NOT working on pages and other fatal mis-configurations
    • captcha required but captcha image missing = captcha error
    • one popular article site platform (forgot which one) by default has ERROR in confirmation landing page you end up when clicking confirmation email link. but that ERROR message is an ERROR in itself.when ignoring the "site banned" or similar ERROR message and clicking to "login" ALL works and exists
    • or low PR sites have "pending approval" queue of up to 300'000+ (300+k) ( that's the highest number I have seen myself with several 10k being quiet frequent specially on WP sites)

    A page PR is of little importance for the functionality of a site

    but high PR sites just have invested more efforts to get site running and faster than cheap hosted sites

    quality hosting has a price

    success in life as well has a price

    and often to increase quality requirements and investments into quality results in MUCH greater success = MORE site visitors = more fun to work for = more rewards in life

  • that was a long posting and i appreciate it, but i will still use PR0 :)

    on lower tiers, 1 link is better than 0 link, quality doesnt matter here. if it gets indexed and then it gets lost, who cares? G maybe indexed the link, maybe not, and if it gets removed or the free hosting expired or whatever, G will need MONTH to notice that and de-index the link.

    Best Regards
Sign In or Register to comment.