Skip to content

Help with custom source setups

I'm testing different custom content sources but I'm not sure how best to add them for what I want

For example, if I want a bunch of articles on diabetes, I go to the cdc website which has tons of related content.

I've tried adding the following urls to my custom sources section, and have the keyword 'diabetes' enabled to search with

https://search.cdc.gov/search/

Some of those urls give back some diabetes stuff but even the ones that do there is also tons of irrelevant stuff its going thru.

If I start on their homepage https://www.cdc.gov/ there is a search box...if I search for diabetes it takes me to their search subdomain referenced above https://search.cdc.gov with all the content on diabetes

But If I just use https://cdc.gov as a content source it just starts scraping the entire site which is about tons of stuff other than diabetes.

How would I structure this to pull the right data from this site? 

I'm having similar issues with webmd which has a similar setup as the cdc.

Also, what does the 'extract links up to level' function really mean? If I have it set to 2 for example it will go to 
https://search.cdc.gov/search/?query=diabetes&dpage=1 and then https://search.cdc.gov/search/?query=diabetes&dpage=2 then stop there or am i way off?

Thanks


Comments

  • SvenSven www.GSA-Online.de
    this is scpecial as searches are actually performed by javascript calls and json delivery. I make a custom search engine for it on next update.
  • googlealchemistgooglealchemist Anywhere I want
    Sven said:
    this is scpecial as searches are actually performed by javascript calls and json delivery. I make a custom search engine for it on next update.
    thatd be awesome thanks...same for webmd if possible

    id still like to get a better understanding of how to add sites like these and other custom sources myself. i have a lot of authority sites...some have search functions like that, some dont (ill just add them as article sites directly via their homepage?)

    where would i add the %search% thing? wherever the keyword shows up in the url after i do a manual search for it? or append it to the end no matter what or hows that work?

    it doesnt matter so much for sites that are very specific to one topic, but for ones that are mixed, i dont want to be just scraping the entire site of all pages when most will be irrelevant.

     and im still confused about the 'extract links up to level' function . should i just set that to the max of 9999 for big sites that are specifically niche relevant? 
  • SvenSven www.GSA-Online.de
    just have a look in the appdata folder scraper\article\*.ini << there you find them all and the "language" to add new sources is actually really easy to understand.

    Also what's webmd ?
  • SvenSven www.GSA-Online.de
    9999 seems a bit out of focus really. Think of the level as a way you would click on sublinks to reach the article. It doesn't seem practical to use 9999 clicks.
Sign In or Register to comment.