Running scraped lists

Artsi · April 2014

Hello!

So, it's been like a month since I got into this whole "scraping business"

I'm running Gscraper on one of the solidseo vps's: Geek.

Now... I've been able to come up with cleaned lists of around 30-60k, a couple 200k's even.

I have some questions for people driving just these scraped lists on SER.

1) What kind of verified percentages do you typically see?
2) What footprints / keywords do you use?
3) What's your particular process of cleaning the lists?

I'll share my answers here...

1) Well, I'm just running my first two scraped lists (I think they were something like 30k and 60k in size). Right now, as I've run them for some time it seems like around 10% at the other list and 25% at another).

2) For these ones I just grabbed all the footprints from SER and threw in the 428k list of keywords I have. I'm planning to scrape more niche specific keywords and more focused footprints, like only social bookmarks etc.

3) I first remove duplicate urls and domains. I then get the http status and set it to title, after which I remove all the ones containing any of these: 3,4,5, "the".

Right now it seems like SER is really not eating up those lists at all... CPU is at max even with version 8.0 (there's probably some sort of bug running imported lists in the latest SER).