Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Optimizations

edited September 2012 in Feature Requests
I've seen the option to use proxies for PR checking that was added because some were getting a ban on their ip for Pr checking too often and I think it is a good option to have. But this has led me to think about the way proxies are managed and this is what I 've come up with:
Usually SER uses many search engines to scrape targets, so it is not easy to have your ip banned for searching even if you don't use proxies. Also for posting, since you are not posting to the same site many times you shouldn't have any problems although you should use proxies to maintain anonimity.
Instead for PR requests, I think that you will be always checking on the same servers (google's?) and this will get you temporarily banned. So the option to use proxies for this is a great option.
But what happens if you don't use proxies, use public ones or private ones that might get banned too?
Basically when SER finds a target it tries to do a series of things like downloading html, checking PR, submitting to forms and so on...
All these operations are done using proxies (if setup correctly).
Many have reported seeing in the logs long streaks of "download failed" messages or "PR ? too low" and if I remember well Sven said that most of these could be due to bad proxies.
In this situation SER basically marks these sites as "already parsed" and they are not considered in the future searches.
I think that there should be an option at the project level that if enabled will not put these targets in the "already parsed" list.
I mean if SER didn't get a response from PR checking, it doesn't know the PR so it shouldn't post to this site, but it also should not be considered too low. Maybe a bad proxie issue is up and trying that target on a future search (maybe tomorrow) might end up with a good target. Also if you get "download failed" message, it might mean that the proxy is dead or too slow, and maybe when you'll have fresh proxies you might be able to post to that site.
So on projects where I might not expect to find that many targets, I might want to keep the option to only mark as "already parsed" (bad sites) the ones that I am really sure they are, and not the ones that haven't responded.

Another question is about the proxy testing that SER does. If I've understood it well it basically only checks to see if proxy is alive by downloading a specific page and searching for a string, so if a proxy is alive but banned for certain operations (like PR checking by Google) this wouldn't be detected, the proxy will be used, you will get PR ? and you target will be marked as bad target...is this correct?

Thank you.

Comments

  • Saving submitted but failed sites was something i asked for a few months ago, but Sven was not keen on the idea unfortunately.

    Just saving submitted sites on a per project basis (eg not the global sites list, and not the current submitted list that is decremented when a target is verified or timed out) would be helpful.

    Also retrying failures x amount of times is something i would very much like to see, and have mentioned before.
    I would imagine it to work by returning the url to the targets cache, with an additional integer parameter to count the retries, eg:

    targeturl.com/register#1

    This would be especially helpful to those with a limited target pool (eg using restrictive filters like high pr) and/or those using OCR for captchas (eg Captcha sniper)
  • Yes sometimes we are struggling to get new targets and maybe we missed out important ones...other times we will have many targets and we don't wanto to lose time repeating the same ones that have failed...
  • +1 for m3ownz statements. I really would like retry possibilities/counts (distributed over time) too. But only if it wasn't posted to the same domain before (this is very important to don't spam too much).
  • GiorgosKGiorgosK Greece
    edited September 2012
    m3ownz
    saving submitted sites per project was feature request implemented by sven a few days ago
    https://forum.gsa-online.de/discussion/287/add-checkbox-to-not-delete-submitted-urls-from-directory-lists

    its on the project options "no remove"
  • Thanks GiorgosK, i will look into that.
    I had skimmed over it as i dont use directories, but it could be useful thanks.
  • This isn't what he's talking about I don't think, that won't solve the pr ?  retry because if it gets a pr ? and you have it set to a pr1 or higher it will just ignore it and not try it again.   The option you are talking about just doesn't delete ones that are already submitted but fail verification.  They are talking about retrying failed ones...
Sign In or Register to comment.