GSA PROXY SCRAPER: Does it matter which proxy types you scrape with?
in Need Help
I wonder if it matters what proxy types you use when scraping URLs with Scrapebox for expired domains?
Which one works best etc?
Which one works best etc?
Comments
Whats the different compared to using the others? I'm basically scraping directories for expired domains, thats the purpose.
so using socks only proxies you can be sure that the ip is never leaked as the protocol of a socks proxy does not insert extra data to your sent header.
- They don't want their content stolen (and possibly republished elsewhere)
- Scraping takes up bandwidth from them.
- Scraping, if done aggressively enough, can take their website down.
So they use some measures to avoid clearly automated "attacks." One of those measures is IP blocking. Meaning, they will temporarily block IPs that seem a bit too aggressive.
Many sites have no protection in place, but others might use Cloudflare, or even some security policies in their web server software to protect themselves. That's where proxies come in. By constantly changing the IP of your connection, they can't block you because they don't know which IP is you, and which IP is a new visitor.
Proxy Anonymity
Now, some proxies leak your originating IP. There are different levels of proxy anonymity. Transparent proxies give the server your IP. Anonymous Level 2 proxies hide your IP, but the server can tell the connection is coming from a proxy server, they just don't know what the IP is behind that proxy. Elite Level 1 Proxies hide your IP and provide no hint that a proxy server is involved in the visit at all.
SOCKS vs HTTP
This is kind of a big subject as far as explaining and it's pretty technical, so here's a good resource if you want to learn about the differences:
http://ghostproxies.com/blog/2016/04/difference-http-socks-proxies/
Scraping Directories
To answer your question, if you're scraping directories, you might be able to get away with not using proxies, but if the directory is using any sort of protection, they'll probably temporarily block you, then you'll have to use proxies. I generally use proxies for everything, but that's not necessarily the right way to do things, that's just how I do it. The trade off with proxies is speed (proxies are slower than directly connecting), and if you're using public proxies, they often suck and burn out quickly because so many other people are using them.