You mentioned anonymous level, I have zero knowledge about proxies so therefore curious as why it should matter that you want to scrape with socks proxies.
Whats the different compared to using the others? I'm basically scraping directories for expired domains, thats the purpose.
using a proxy which will leak your real IP can be considered as transparent proxy. Thats something you do not want as the websites might block you then as well when they see you connecting from different proxies with same real ip in header.
so using socks only proxies you can be sure that the ip is never leaked as the protocol of a socks proxy does not insert extra data to your sent header.
@antonearn - Sites will often try to protect themselves from scraping for a few reasons:
- They don't want their content stolen (and possibly republished elsewhere) - Scraping takes up bandwidth from them. - Scraping, if done aggressively enough, can take their website down.
So they use some measures to avoid clearly automated "attacks." One of those measures is IP blocking. Meaning, they will temporarily block IPs that seem a bit too aggressive.
Many sites have no protection in place, but others might use Cloudflare, or even some security policies in their web server software to protect themselves. That's where proxies come in. By constantly changing the IP of your connection, they can't block you because they don't know which IP is you, and which IP is a new visitor.
Proxy Anonymity
Now, some proxies leak your originating IP. There are different levels of proxy anonymity. Transparent proxies give the server your IP. Anonymous Level 2 proxies hide your IP, but the server can tell the connection is coming from a proxy server, they just don't know what the IP is behind that proxy. Elite Level 1 Proxies hide your IP and provide no hint that a proxy server is involved in the visit at all.
SOCKS vs HTTP
This is kind of a big subject as far as explaining and it's pretty technical, so here's a good resource if you want to learn about the differences:
To answer your question, if you're scraping directories, you might be able to get away with not using proxies, but if the directory is using any sort of protection, they'll probably temporarily block you, then you'll have to use proxies. I generally use proxies for everything, but that's not necessarily the right way to do things, that's just how I do it. The trade off with proxies is speed (proxies are slower than directly connecting), and if you're using public proxies, they often suck and burn out quickly because so many other people are using them.
Comments
Whats the different compared to using the others? I'm basically scraping directories for expired domains, thats the purpose.
so using socks only proxies you can be sure that the ip is never leaked as the protocol of a socks proxy does not insert extra data to your sent header.
- They don't want their content stolen (and possibly republished elsewhere)
- Scraping takes up bandwidth from them.
- Scraping, if done aggressively enough, can take their website down.
So they use some measures to avoid clearly automated "attacks." One of those measures is IP blocking. Meaning, they will temporarily block IPs that seem a bit too aggressive.
Many sites have no protection in place, but others might use Cloudflare, or even some security policies in their web server software to protect themselves. That's where proxies come in. By constantly changing the IP of your connection, they can't block you because they don't know which IP is you, and which IP is a new visitor.
Proxy Anonymity
Now, some proxies leak your originating IP. There are different levels of proxy anonymity. Transparent proxies give the server your IP. Anonymous Level 2 proxies hide your IP, but the server can tell the connection is coming from a proxy server, they just don't know what the IP is behind that proxy. Elite Level 1 Proxies hide your IP and provide no hint that a proxy server is involved in the visit at all.
SOCKS vs HTTP
This is kind of a big subject as far as explaining and it's pretty technical, so here's a good resource if you want to learn about the differences:
http://ghostproxies.com/blog/2016/04/difference-http-socks-proxies/
Scraping Directories
To answer your question, if you're scraping directories, you might be able to get away with not using proxies, but if the directory is using any sort of protection, they'll probably temporarily block you, then you'll have to use proxies. I generally use proxies for everything, but that's not necessarily the right way to do things, that's just how I do it. The trade off with proxies is speed (proxies are slower than directly connecting), and if you're using public proxies, they often suck and burn out quickly because so many other people are using them.