Skip to content

Scraping Using Gsa ser and scrapebox

I need alittle help from scraping expert guys
I want to scrape only wikis that have .edu or.gov in the url. I am using gsa ser default scraper. It has the option and footprints for wikis scraping but how can I tweak it to scrape only edu gov wikis. ? Any one having custom edu gov wiki footprints. I also have scrapebox but it allows only one footprint at a time. How can I use multipe footprints for scraping in scrapebox at a time. I am a noob in scraping so every little help will be appreciated . Thanks

Comments

  • shaunshaun https://www.youtube.com/ShaunMarrs
    edited October 2016

    I would merge my current footprints with custom search modifiers to my queries, just be aware you will have to increase your timeout because Google know what you are doing with these custom modifiers and will softban your proxies much quicker.

    For example one of the media wiki footprints would become...

    "what links here" "related changes" "special pages" inurl:.edu

    And it returns these results.

    Do that for all of the footprints for .edu and .gov and let scrapebox go. Try build up your own set of custom footprints too, they help so much on the contextual side of things.

    Mess about with where you put quotes and such as that returns some .coms and so on but in the options pane of your SER project there is a tick box that allows you to select what TDLs to post/not post to for that particular project.

  • shaun Being a total noob, it took 15 mins to understand what you have explained. :P 
    But there is still a confusion.  I want to use only those footprints that can scraped sites that can also be identified by gsa ser. 
  • shaunshaun https://www.youtube.com/ShaunMarrs
    edited October 2016
    Thats not possible to my knowledge as google returns what it feels fits the query best so you are always going to have a fair amount of waste.

    I dont have the exact numbers to hand but here is a rough numbers breakdown of what you can probably expect when scraping your own contextual lists.

    Initial scrape - 100,000 targets.
    Identified by GSA PI - 50,000.
    Verified by GSA SER/CB - 200.
    Do follow - 100.
    Root Domain indexed - 70.

    As I said I dont have the exact numbers but I would say this is a fair representation of what to expect when building your own contextual lists I can remember when I first started doing it I expected the numbers to be much larger but I don't remember ever getting over 1% of the scrape as an actionable link to go live but that said I am pretty strict on what I use.

    These numbers are part of the reason I have scrapped building my own lists and decided to just go back to looplines service.

  • Why not just buy a list?
    There are a few reputable sellers here..
Sign In or Register to comment.