Hello, people of the world on the GSA forum. I hope that if you're in a Covid-19 place that's locked down, you're OK.
Anyway, I was wondering if it might be worthwhile to do something like:
As GSA proxy Scraper runs, and individual proxies are found to be working or not working with different search engines on scraping with SER, serve only those proxies that have NOT yet shown to be not working with a certain SE. So, the default for a proxy would be to serve that proxy unless specifically banned.
I know this might slow things down a lot? So just sharing for fun.
Maybe SER might use a second port for each assigned working port, communicating on the second pair of each port to PS what SE that first port in the pair SER is trying to scrape, and if it's successful. So, a port to use as usual, and a second to use to keep track of SEs so PS can set a bit for each SE, storing that with other characteristics of each proxy.
At the same time, if a proxy starts to fail on scraping with a certain SE (where it either has been marked as working before, or just never tested against that SE), the second port could be used to send data back to PS that that particular proxy is no longer good for that SE.
Of course, as PS tests proxies against SEs, that data can be used as well.
It's just that SER gets valuable data every time a proxy fails/shows as blocked on a certain SE, so why waste that?
Like other functions, the proxies can be set to be removed on 1, 2, or other number of fails.
This second channel to communicate current proxy's SE being scraped could also be used with tests that are done by PS byh user (or automatically) against SEs. Again, the data is there, and there would be a higher accuracy for which proxies to serve for which SEs.
Sven, is the doable? Is it worthwhile? Would it slow things down immensely? Is the loss in speed worth the gain not serving proxies that have been found to be unusable, either by PS test or during actual project runs of SER?
Have a great weekend, everyone.
Hmm..I guess for this to work, SER would also have to inform PS when a port is being used for scraping, and if so, which SE so only the non-failed/blocked proxies will be served.