Optimizations
I've seen the option to use proxies for PR checking that was added because some were getting a ban on their ip for Pr checking too often and I think it is a good option to have. But this has led me to think about the way proxies are managed and this is what I 've come up with:
Usually SER uses many search engines to scrape targets, so it is not easy to have your ip banned for searching even if you don't use proxies. Also for posting, since you are not posting to the same site many times you shouldn't have any problems although you should use proxies to maintain anonimity.
Instead for PR requests, I think that you will be always checking on the same servers (google's?) and this will get you temporarily banned. So the option to use proxies for this is a great option.
But what happens if you don't use proxies, use public ones or private ones that might get banned too?
Basically when SER finds a target it tries to do a series of things like downloading html, checking PR, submitting to forms and so on...
All these operations are done using proxies (if setup correctly).
Many have reported seeing in the logs long streaks of "download failed" messages or "PR ? too low" and if I remember well Sven said that most of these could be due to bad proxies.
In this situation SER basically marks these sites as "already parsed" and they are not considered in the future searches.
I think that there should be an option at the project level that if enabled will not put these targets in the "already parsed" list.
I mean if SER didn't get a response from PR checking, it doesn't know the PR so it shouldn't post to this site, but it also should not be considered too low. Maybe a bad proxie issue is up and trying that target on a future search (maybe tomorrow) might end up with a good target. Also if you get "download failed" message, it might mean that the proxy is dead or too slow, and maybe when you'll have fresh proxies you might be able to post to that site.
So on projects where I might not expect to find that many targets, I might want to keep the option to only mark as "already parsed" (bad sites) the ones that I am really sure they are, and not the ones that haven't responded.
Another question is about the proxy testing that SER does. If I've understood it well it basically only checks to see if proxy is alive by downloading a specific page and searching for a string, so if a proxy is alive but banned for certain operations (like PR checking by Google) this wouldn't be detected, the proxy will be used, you will get PR ? and you target will be marked as bad target...is this correct?
Thank you.
Comments
saving submitted sites per project was feature request implemented by sven a few days ago
https://forum.gsa-online.de/discussion/287/add-checkbox-to-not-delete-submitted-urls-from-directory-lists
its on the project options "no remove"