Issues Harvesting GSA List with ScrapeBox

Malice · October 2015

Hey there,

Long time watcher, first time poster. Anyways, the last while I have just relied primarily on scraping Bing with semi-dedicated proxies...I know I know, but it worked for what I needed it for.

I decided to take the next step last night and invest in some US dedicated proxies from BuyProxies. I only purchased ten, which I figured should be enough for what I need them for...scrapebox harvesting.

As recommended by the brilliant minds here in this forum I decided to play it safe and go single threaded posting. However, I only am able to harvest a handful of results before the proxies stop working. I am not using them for anything else really. I tried just using the new dedicated proxies by themselves last night, and they keeled over almost immediately. I contacted the provider and they switched the proxies out for a different batch. So I added a 10 second delay and tossed in 10 of my semi-dedicated proxies from another provider (which actually harvested more than the dedicated proxies from BuyProxies). Anyways...I am guessing I Noob F-ed up somewhere in the harvesting settings.

I searched for a solution but didn't have much luck. This forum seems incredibly knowledgable about everything so I figured I would ask.

Can anyone take a gander and let me know if I am doing something wrong? Or any advice on what I should do?

Scrapebox Settings Screenshot

http://i57.tinypic.com/3466ura.jpg

Thanks a bunch!

Malice

spiritfly · October 2015

Okay I'm probably not the expert in scrapbox you were looking for (you most certainly want @loopline take on this one) but I know a think or two.

First of all you mention scraping from Bing, but I'm guessing you're now trying to scrape Google with the semi proxies. Nothing wrong with that, except that you have to be extra careful as you already know. You should set definitely set a delay in Setttings>Harvester Engines Configurations on Google at least to 30 seconds.

From the settings you have shown I will do the following changes:
Timeouts>Harvester>60(maybe even 30) - you don't need 120 since you're using private proxies.
Connections>Harvester>2-4 (play with that) - since you mentioned 10 proxies.

Harvester Proxy Retries>OFF
Harvester Proxy Timeout>OFF
Harvester Error Retries>OFF
Harvester Max Redirect>OFF

You don't need those. You would mostly need them in case you were using public proxies.

Also keep in mind that some footprints like those using inurl or "Powered by" etc.. can trigger the captcha code protection on your proxy very easily. The captcha protection is only temporary and goes away after a few hours. It doesn't mean your proxy is banned. So your proxy will still show as "passed" when testing it even though it shows captcha on every search and will render errors in scrapebox.

I hope I was helpful.

Malice · October 2015

Thanks Spiritfly. I am going to go ahead and give that a go. If it doesnt work I guess I will pester BuyProxies and see if I cant get them changed out.

Thanks!

Malice

TheGypsy · October 2015

I rely on public proxies for scraping so I don't have recent experience with private one. If I recall it correctly then loopline recommended a 16 proxy per 1 thread ratio or even more, some time ago.

You may want to check out The Art of Harvesting from loopline on youtube and check out his scrapebox forum. There is myriad of information about harvesting there.

Issues Harvesting GSA List with ScrapeBox

Comments