Scraping General Q: How Can I Get Better Results?
I have tried scraping a few times. I have always just been discouraged by the results. Especially the fact that there are so many web sites that are super-authority sites and few targets to post links on.
Is this normal to have some URLs that do not fit what you're looking for?
I have noticed this every time I attempted to scrape targets. Right now I'm using some GSA SER footprints plus a KW. But even years ago, I tried scraping and this also happened.
Just need some guidance. Thanks, all....
Comments
Are you scraping too many sites that don't work with the software? - You need to adjust your footprints - maybe test the footprint manually before running it through your scraping bot.
Or are you just not returning any sites when scraping?
There is no way to target just authority sites. You have to go through the process of scraping and testing to build the site list. Some will be authority sites, but most will be DR0-DR20 sites.
These can still have value, but you'll need to run extra campaigns to boost their DR value to make them valuable. Just running tiers with your site list will eventually boost the authority of all sites in your list.
You'll need to test each main footprint first, one by one. If it yields no results, that means that adding a keyword on the end will also yield zero results.
You'll eventually filter out the non-working footprints and end up with a list of footprints that do work. So don't be dis-heartened by the lack of sites.
For article sites, the only engines you'll still find sites for are:
gnu board
dwqa
osclass
classipress
bbpress
wp foro
buddypress
moodle
joomla k2
drupal
question 2 answer
xpress engine
If it's not on that list, you won't find sites for it. It's a similar situation for forums, social network and wiki sites - only some of these engines will have working sites avaiable. You'll just have to test each one and see what results you get. You only need to test it once - if you don't scrape anything with the footprint, then move onto the next one.
That's why I invested into the new sernuke engines. The git alikes package yields the most sites - 2000+ No other engines in the software will yield that many sites anymore.
3 years ago we used to get thousands of gnu board sites - not anymore - it's in the hundreds now.
7+ years ago we had thousands of joomla k2 sites - not anymore I'm down to 3 sites lol Google have made it impossible to scrape these with footprints - they've blocked them.
These engines are all public link sources - they will go through a cycle of being spammed to death and eventually site owners abandon their sites.
Different search engines will also behave differently - as you mentioned with Bing - I see similar things happenning with seznam, yandex, aol and duck duck go. They have different search operators which is worth researching further into. What works in Google won't necessarily work in other search engines.
That's why you should test the footprint first manually. Only then can you automate it once you've tested that it works manually. Just keep testing and adjust your strategies. Sounds like you're on the right track, so keep at it.
@sickseo - is it that Google has blocked joomla k2 sites or Joomla made some changes?