Skip to content

What are the criteria for scraping sources?

Hi Sven and other CG users!  I'd like to know what is doable and what isn't, in terms of scraping sources. I guess any page that has a search input will work as a search engine? What about other pages? What cannot be scraped at this time? Thanks, Deeeeeee

Comments

  • SvenSven www.GSA-Online.de
    1. with search engines it is using any of the defined keywords in project (IB column) to search for content else it will take the content of the given URL directly.
    2. It searches on that page for a keyword and extracts content around it...thats done for all places where the keyword is found. The longest content is taken and
    3. a meaningful title is extracted from that page (<title>, <h1>, <h2>...whatever seems logical).
    4. the content is saved in raw_data folder for later analysis
    Thanked by 1Deeeeeeee
Sign In or Register to comment.