Internal Proxy Server
Tools like ScrapeBox can not handle CONNECT proxies and so you will see different results when using them in that software.
1. Internal Proxy Server
In GSA Proxy Scraper you have the option to enable it's own internal proxy server in options (off by default). Once you have it running, it will allow you to use the proxy with IP 127.0.0.1 and Port 8080 (default values are changeable) in every other software. So adding 127.0.0.1:8080 as a proxy will allow other tools (not only GSA software) to make use of all proxies within GSA Proxy Scraper.
- - - - - - - - - - - - - -
@Sven,
So best settings for ScrapeBox seem to be:
- Use Internal Proxy server
- Uncheck CONNECT proxies
- Insert server IP into ScrapeBox
- Have GSA re-scan / test every X minutes and remove bad proxies
Uses:
The proxies will be used for scraping domains, so for example, loading the internal links of a lot of pages, then afterwards loading the external links in attempt to find domains.
Filters:
Do not accept anonymous (no elite) proxies? Do no-elite proxies have a higher potential of leaking IP? I'm running this from my home ISP, so I want to make sure I don't get any phone calls asking if I'm running a "botnet". Or are regular anonymous proxies safe because they also don't send your real IP?
Do not accept transparent proxies (CHECK)
Skip suspicious proxies (CHECK)
Skip the following IP-Ranges (CHECK) - I'm in the USA, don't want to take any risks with hitting a honeypot.
- - - - - - - - - - - - - -
Should I use NoBlock mode? SB + GSA Proxy Scraper are running on a brand new i5-8400 machine with Windows 7.
It's not necessary to "Set proxy in browser" being I'm doing no web browsing on this PC, right?
- - - - - - - - - - - - - -
Lastly, during my trial of GSA Proxy Scraper, I did notice that ScrapeBox had a lot of "This proxy leaks your IP". However, I didn't read about the CONNECT proxies until just now. Is this what was causing this error?
Thanks a lot!!
In GSA Proxy Scraper you have the option to enable it's own internal proxy server in options (off by default). Once you have it running, it will allow you to use the proxy with IP 127.0.0.1 and Port 8080 (default values are changeable) in every other software. So adding 127.0.0.1:8080 as a proxy will allow other tools (not only GSA software) to make use of all proxies within GSA Proxy Scraper.
- - - - - - - - - - - - - -
@Sven,
So best settings for ScrapeBox seem to be:
- Use Internal Proxy server
- Uncheck CONNECT proxies
- Insert server IP into ScrapeBox
- Have GSA re-scan / test every X minutes and remove bad proxies
Uses:
The proxies will be used for scraping domains, so for example, loading the internal links of a lot of pages, then afterwards loading the external links in attempt to find domains.
Filters:
Do not accept anonymous (no elite) proxies? Do no-elite proxies have a higher potential of leaking IP? I'm running this from my home ISP, so I want to make sure I don't get any phone calls asking if I'm running a "botnet". Or are regular anonymous proxies safe because they also don't send your real IP?
Do not accept transparent proxies (CHECK)
Skip suspicious proxies (CHECK)
Skip the following IP-Ranges (CHECK) - I'm in the USA, don't want to take any risks with hitting a honeypot.
- - - - - - - - - - - - - -
Should I use NoBlock mode? SB + GSA Proxy Scraper are running on a brand new i5-8400 machine with Windows 7.
It's not necessary to "Set proxy in browser" being I'm doing no web browsing on this PC, right?
- - - - - - - - - - - - - -
Lastly, during my trial of GSA Proxy Scraper, I did notice that ScrapeBox had a lot of "This proxy leaks your IP". However, I didn't read about the CONNECT proxies until just now. Is this what was causing this error?
Thanks a lot!!
Comments
Hmm well that's not very promising. GSA PS just crashed after doing a small test scrape with SB, now it's re-testing all of the 60k proxies I imported, and saved no data about previous testing... Sent bug report with my email.
Also. I'm having a really weird bug that's happened multiple times now. The program seems to have some "memory glitch" (memory leak?)? Where it's like frozen in a state of not being able to do anything. I vaguely remember GSA SER having this issue way back in the day. For example, if I press "Quit" on GSA PS, if it's in this "memory glitch" state, it will exit the threads but refuse to quit the program.
I have to ctrl+alt+delete, which upon re-opening, it's seemingly forgotten the state of the proxies, and draws from a previous state (but the options are saved).
2) I'd rather use the Internal Proxy Server with Scrapebox because this system will be running pretty much 24/7 and I don't want to (and am not able to) be constantly testing/importing proxies into SB. I have GSA PS running at 100-150 threads, and SB link extractor running at 100-150 threads. This seems to be a good balance between acquiring new proxies, SB being able to work fine (some errors I just export and re-check them).
3) The proxies aren't used for scraping at all -- I have some Google proxies I use for acquiring targets to use the SB link-extractor, which are already filtered by TF, etc... and the GSA PS proxies are pretty much only being used for extracting internal/external links from websites in order to find expired domains.
4) I did notice some Bing-tagged proxies (even though I don't have bing checked) but they were automatically disabled by GSA PS when SB tried to use them.
5) Yes, it does seem to be a "memory glitch". And it seems very common, already 3 times it's happened within hours of using GSA PS. I remember way back in 2012-2013 GSA had a similar problem. You'd be running a lot of threads, then once they were done, you'd try to end the threads, but they wouldn't stop. Sometimes I remember ctrl+alt+delete and all my links would be gone, the "previous state" as mentioned above. I'm not sure if the error report I sent in related, that was just a hard-crash/close. This "memory bug" only happens after running the program for some time.
For example just a minute ago, I could tell it was happening, because I finished the SB threads... then went to delete some failed proxies in GSA PS, and it wouldn't delete them, I pressed "Delete Highlighted" and it did nothing. So I'm like, "yea this is the memory glitch". I press "Quit" it closes the threads, and doesn't exit the program. That's an example of how/when it happens, although I haven't used it enough to notice any particular pattern that causes it. (I have also now setup automatic export as a fail-safe).
It seems to happens whether I'm actively using the Internal server or not (it's happened when SB wasn't even open). I did notice when the Quit works properly, it says "Next schedule 60 minutes," whereas if the glitch happens, the threads exit without the scheduler being able to assign a new time to check the proxies next. This time I saved the proxies to a file thankfully, because it once again restored a previous "state" that had 300 less proxies.
- - - - - - - - - - - -
Also, if I select "Use search engines to locate proxy lists" and then insert some keywords like "free proxy list" etc, does GSA automatically know how to extract the proxies from the pages? Or is it best to manually search myself and add-in new sources?
Sadly though, when it happens, I press "Quit" to restart GSA PS, and it exits threads but doesn't quit. I have to ctrl alt delete, and then add + re-test the proxies I manually exported.
I also seemingly found the limit of the software. Basically I run the software at 100 threads and SB at 200-300, and it seems to work. Any higher, and thinks do not end up too well!
Anyways, great piece of software. I've also just purchased Proxy Buddy and think it will be a nice compliment to GSA PS.
Is the amount of threads you set in GSA PS relevant to the speed at which Internal proxy server can process / send new proxies? Because when I'm testing it, if it's not searching for new proxies, the threads stay at "0" even if I'm running 100 threads on SB.
I've started now to just quit using it entirely, but I really hate that I have to do that, because allowing it to use the CONNECT proxies results in dramatically higher success rates for certain things (likely because not every other SB user is using it).
Sometimes I'll look at the options in GSA PS, and see the proxies switching crazy fast... others, it'll just be stuck on 1 proxy forever, and this is when I know the program has likely died. Is there a way to make the proxy server faster, to consume more CPU, to be able to perform better?
- - - - -
Also, does running GSA + CB in addition to PS + SB somehow affect the Internal proxy server? I have the "server" option on CB completely disabled... but I notice when trying to run SB + PS and GSA + CB, it seems to break the internal proxy server, or maybe it was just coincidence...?