I have 2 Geeks from SolidSEO. One I've use with GSA, and the other on Gscraper (with the Gscraper proxies).
Thus far no one has given me hard time about it, but this thread got me thinking whether I should switch either the VPS, proxies or both.
Well, did someone figure out a good alternative to the Gscraper proxies? They are good for scraping, but cleaning the lists take about 4-ever.
Where did that BHW proxy guy go? Can I purchase proxies from him? I want him to take my money!
By the way... For the Gscraper super users here, how much bang do you guys have on your VPS to effectively scrape, and actually get some clean lists out in decent time?
Did you test it recently? My scraping involves a lot of INURL scrapes and i found those proxies were better than GScraper for that purpose. He does a free trial so i tested with that.
The speeds on my test was 4x better than Gscraper but that could be only because Gscraper doesn't do INURL well.
Hmm... The way I tested it out was that I took all the footprints from SER, then removed all that have less than 1M indexed pages. I had like 100k or so keywords.
Then I let it run.
On my SolidSEO VPS (Geek) I was (am) able to get to a pretty consistent 6k-15k per minute scrape (sometimes as high as 30-60k / minute on Gscraper proxies.
The Red Proxies I got from BHW were delivering at about 2k-4k per minute. Strange. The guy DID say that other users are scraping a-lot more, so it might be some kind of VPS issue? In that case I don't understand why the Gscraper proxies work so well...
Overall, I find this topic super interesting! I'd like to be able to scrape my own lists, but I find that the cleaning part messes me up big time. I could (and did just recently) scrape 45M urls in a day and a half, but then Gscraper completely chokes on Geek if I even so much as try to import more than, say, 5M urls.
I think I've let it check the index count for something like 2 days straight right now, and it's processed 25% of a 2,8M urls list. Going to take me forever with this kind of speed
So, very interested to hear other people's VPS setups of running Gscraper effectively.
I mean, do you really need these 3-figure VPS's to do it right? Or am I doing something wrong?
If you put that into Gscraper, it gives you a warning that scraping is very slow for inurl.
I haven't tried the full service from Red Proxy so i can't say to you 100% it is faster. But on the test i ran, Gscraper did 8,000/min and Red Proxy did 30,000/min
Well, that's one of those mysteries of the universe that I don't understand.
The RP seller told me to use, what, 50 threads and a 60s timeout? Something like that?
With GS proxies, I've used 1500 threads and a timeout between 15 and 60. Whenever I even try to think how 50 and 1500 threads could be equal in power, my brain gets a blue screen and restarts. Then, when I'm back in this world, I just don't think about it anymore and I go with the 1500 threads.
Now, I don't of course matter if I were to use 0.72 threads and a 1440 000 timeout if those settings give me even 10k+ / minute scraping...
@gooner, I didn't see you saying it on this thread, so may I ask what kind of VPS are you running the Gscaper on? And do you find that you get even remotely that kind of speed when you're cleaning the list you've scraped?
@Artsi - Oh good point, i didn't listen to his advice. I went right to 1000 threads and 15 second timeout! lol
No way you can get that speed on 50 threads. But he told me 50 threads is just for the test, on the real service you get more proxies and can use 1000 threads or whatever.
I run GScraper on a dedicated server and you are absolutely right, after you clean the list it is much smaller. But it's all about volume, i scrape more than i can use on 2 dedicated servers so i don't think it's a big problem.
Are you sure that's what he said? I subscribed to his service right away, and we exchnged emails back and forth, and he definitely said me to stick with 50 threads. Am I being sandboxed by a proxy provider now? Dang this internet marketing is a pain-in-the-grass.
Okay, I need to ask him once more if he's made an error in the thread numbers or something.
All right. What I was thinking is, that isn't Geek - as an example - also a dedicated server? Just thinking...
Sure sure, I'm not that worried about the volume... As a matter of fact, the volume is not a problem at all to me personally. The problem is in having this tremendous volume of urls (like a full tanker) that I'm trying to shoot through a golf-ball sized hole. Takes forever to clean those lists... You think it might be a RAM or proxy issue or both? Or is there even more to this that I'm not yet seeing?
Thanks for the answers by the way @gooner. Really appreciate the help
It could be RAM i guess, one thing i usually i do is i stop the scrape every 3 - 5 hours (or when i remember) and make it start again. Then you get smaller lists to work with.
You could also set Gscraper so it does not scrape any dups, but that takes more memory to run.
@gooner - "It could be RAM i guess, one thing i usually i do is i stop the scrape every 3 - 5 hours (or when i remember) and make it start again. Then you get smaller lists to work with."
Why not just use the option to have gscraper split your lists while scraping? I have mine at 10 million per file and it has been working out great for me.
@Artsi, if I understand you correctly, you are checking all your scraped links to see if they're indexed in google? If this is the case, why would you even bother as you scraped them off google and therefore they must be indexed already?
Guys using Red Proxy or other public proxy providers, is it not a total ball-ache switching proxies every 3-4 hours when they get updated? The reason I buy tools is to be as automated as possible, but this sounds like I need to baby sit stuff, which I don't want to do.
I literally can't decide whether to buy GS or not - spent way too much on SEO toys this month and that's why I'm on the fence. Do you use those proxies just for scraping and dedis for posting on SER?
How much time do you spend on it a day? I appreciate Gooner you're probably more likely to use it more than others if you're creating lists for yourself and to sell.
Why is it so much better than Scrapebox?
I'd like it if someone could tell me that they bought Gscraper, there's a bit of a learning curve then BAM it's a one-click thing a day and load the lists into SER.
Hey @judderman - Those proxies are updated twice daily, you can change proxies in Gscraper while it's running so it's just a couple of clicks twice per day.
Dedicated proxies would be burned out very quickly i'd say, you can crank GS up to 1500 threads even on a mid-spec machine. That's the only reason i would pick GS instead of SB... Speed.
It's much faster, but it only scrapes Google so i use SB as well just to maximize results
EDIT: Gscraper has a free version, so you can give it a test run mate.
It's hard to do a side-by-side comparison because you can't run SB on the same number of threads as GS (SB uses too many resources).
SB is prone to errors, crashing etc too. I would guess an identical scrape might yield more results with SB but it would take 10 times longer, maybe more.
GS does yield a lot of dups, but you can set it not to scrape dups at all (uses more memory).
Put it this way... I could stop using SB tomorrow and it wouldn't make much difference. But GS is essential.
@Artsi, I've before checked a list of 65k urls if they were indexed or not in scrapebox, and I had to repeat the process 4 times to get a good amount of them all checked properly, because proxies kept dying and burn even though I was using thousands of proxies I had just tested. I was still missing a fair amount of the total list after 4 tries. If you are only running your check once then it might explain why you're seeing what you are. Also I don't know how gscraper handles it when it encounters a proxy which is no good.
However, I am certain that everything you're scraping with gscraper is indexed in Google. You couldn't find it with gscraper if it wasn't indexed.
Comments
I have 2 Geeks from SolidSEO. One I've use with GSA, and the other on Gscraper (with the Gscraper proxies).
Thus far no one has given me hard time about it, but this thread got me thinking whether I should switch either the VPS, proxies or both.
Well, did someone figure out a good alternative to the Gscraper proxies? They are good for scraping, but cleaning the lists take about 4-ever.
Where did that BHW proxy guy go? Can I purchase proxies from him? I want him to take my money!
By the way... For the Gscraper super users here, how much bang do you guys have on your VPS to effectively scrape, and actually get some clean lists out in decent time?
Take a look on BHW
Dang, I need to finally go ahead, and register at BHW. Been just hiding in the bushes thus far.
Thanks for the tip in any case!
Did you test it recently? My scraping involves a lot of INURL scrapes and i found those proxies were better than GScraper for that purpose. He does a free trial so i tested with that.
The speeds on my test was 4x better than Gscraper but that could be only because Gscraper doesn't do INURL well.
Hmm... The way I tested it out was that I took all the footprints from SER, then removed all that have less than 1M indexed pages. I had like 100k or so keywords.
Then I let it run.
On my SolidSEO VPS (Geek) I was (am) able to get to a pretty consistent 6k-15k per minute scrape (sometimes as high as 30-60k / minute on Gscraper proxies.
The Red Proxies I got from BHW were delivering at about 2k-4k per minute. Strange. The guy DID say that other users are scraping a-lot more, so it might be some kind of VPS issue? In that case I don't understand why the Gscraper proxies work so well...
Overall, I find this topic super interesting! I'd like to be able to scrape my own lists, but I find that the cleaning part messes me up big time. I could (and did just recently) scrape 45M urls in a day and a half, but then Gscraper completely chokes on Geek if I even so much as try to import more than, say, 5M urls.
I think I've let it check the index count for something like 2 days straight right now, and it's processed 25% of a 2,8M urls list. Going to take me forever with this kind of speed
So, very interested to hear other people's VPS setups of running Gscraper effectively.
I mean, do you really need these 3-figure VPS's to do it right? Or am I doing something wrong?
PS: I also registered on BHW - finally.
inurl:"wiki"
If you put that into Gscraper, it gives you a warning that scraping is very slow for inurl.
I haven't tried the full service from Red Proxy so i can't say to you 100% it is faster.
But on the test i ran, Gscraper did 8,000/min and Red Proxy did 30,000/min
Well, that's one of those mysteries of the universe that I don't understand.
The RP seller told me to use, what, 50 threads and a 60s timeout? Something like that?
With GS proxies, I've used 1500 threads and a timeout between 15 and 60. Whenever I even try to think how 50 and 1500 threads could be equal in power, my brain gets a blue screen and restarts. Then, when I'm back in this world, I just don't think about it anymore and I go with the 1500 threads.
Now, I don't of course matter if I were to use 0.72 threads and a 1440 000 timeout if those settings give me even 10k+ / minute scraping...
@gooner, I didn't see you saying it on this thread, so may I ask what kind of VPS are you running the Gscaper on? And do you find that you get even remotely that kind of speed when you're cleaning the list you've scraped?
No way you can get that speed on 50 threads. But he told me 50 threads is just for the test, on the real service you get more proxies and can use 1000 threads or whatever.
I run GScraper on a dedicated server and you are absolutely right, after you clean the list it is much smaller.
But it's all about volume, i scrape more than i can use on 2 dedicated servers so i don't think it's a big problem.
Are you sure that's what he said? I subscribed to his service right away, and we exchnged emails back and forth, and he definitely said me to stick with 50 threads. Am I being sandboxed by a proxy provider now? Dang this internet marketing is a pain-in-the-grass.
Okay, I need to ask him once more if he's made an error in the thread numbers or something.
All right. What I was thinking is, that isn't Geek - as an example - also a dedicated server? Just thinking...
Sure sure, I'm not that worried about the volume... As a matter of fact, the volume is not a problem at all to me personally. The problem is in having this tremendous volume of urls (like a full tanker) that I'm trying to shoot through a golf-ball sized hole. Takes forever to clean those lists... You think it might be a RAM or proxy issue or both? Or is there even more to this that I'm not yet seeing?
Thanks for the answers by the way @gooner. Really appreciate the help
It could be RAM i guess, one thing i usually i do is i stop the scrape every 3 - 5 hours (or when i remember) and make it start again. Then you get smaller lists to work with.
You could also set Gscraper so it does not scrape any dups, but that takes more memory to run.
So, @gooner, when you start cleaning a list... How big of a chunk do you import into Gscraper at a time? 100k? 500k? 1M? Even more?
I'm definitely not going to import 5M at a time on this particular VPS anymore. It's just too much for it.
Maybe try to break your keywords (or footprints) into smaller chunks, i have mine in different files for different scrapes.
The idea is that the maximum i will ever scrape is around 1 - 2M. Gscraper handles that very well for me.
Hehe, well if 5M sounds too much for your dedicated server, then maybe I shouldn't be mulling over such lists either
All right. I'll just stop the index checking right now, export the 2,8M urls and split it into 4 parts. Let's see if that brightnens things up
Dedicated proxies would be burned out very quickly i'd say, you can crank GS up to 1500 threads even on a mid-spec machine. That's the only reason i would pick GS instead of SB... Speed.
It's much faster, but it only scrapes Google so i use SB as well just to maximize results
EDIT: Gscraper has a free version, so you can give it a test run mate.
It's hard to do a side-by-side comparison because you can't run SB on the same number of threads as GS (SB uses too many resources).
SB is prone to errors, crashing etc too. I would guess an identical scrape might yield more results with SB but it would take 10 times longer, maybe more.
GS does yield a lot of dups, but you can set it not to scrape dups at all (uses more memory).
Put it this way... I could stop using SB tomorrow and it wouldn't make much difference. But GS is essential.
And posting to un-indexed sites is a waste of time and resources.
Where the hell does it then check the sites from?
Don't you guys check for the indexing? And why are there so many non-indexed sites if you actually do check it?
Strange...
I remember at some point where something like 1M urls turned into something like 20k or something. That got me veeeery frustrated!