Scraping Using Gsa ser and scrapebox
I need alittle help from scraping expert guys
I want to scrape only wikis that have .edu or.gov in the url. I am using gsa ser default scraper. It has the option and footprints for wikis scraping but how can I tweak it to scrape only edu gov wikis. ? Any one having custom edu gov wiki footprints. I also have scrapebox but it allows only one footprint at a time. How can I use multipe footprints for scraping in scrapebox at a time. I am a noob in scraping so every little help will be appreciated . Thanks
Comments
I would merge my current footprints with custom search modifiers to my queries, just be aware you will have to increase your timeout because Google know what you are doing with these custom modifiers and will softban your proxies much quicker.
For example one of the media wiki footprints would become...
"what links here" "related changes" "special pages" inurl:.edu
And it returns these results.
Do that for all of the footprints for .edu and .gov and let scrapebox go. Try build up your own set of custom footprints too, they help so much on the contextual side of things.
Mess about with where you put quotes and such as that returns some .coms and so on but in the options pane of your SER project there is a tick box that allows you to select what TDLs to post/not post to for that particular project.
There are a few reputable sellers here..