Skip to content

Help with custom source setups

I'm testing different custom content sources but I'm not sure how best to add them for what I want

For example, if I want a bunch of articles on diabetes, I go to the cdc website which has tons of related content.

I've tried adding the following urls to my custom sources section, and have the keyword 'diabetes' enabled to search with

https://search.cdc.gov/search/

Some of those urls give back some diabetes stuff but even the ones that do there is also tons of irrelevant stuff its going thru.

If I start on their homepage https://www.cdc.gov/ there is a search box...if I search for diabetes it takes me to their search subdomain referenced above https://search.cdc.gov with all the content on diabetes

But If I just use https://cdc.gov as a content source it just starts scraping the entire site which is about tons of stuff other than diabetes.

How would I structure this to pull the right data from this site? 

I'm having similar issues with webmd which has a similar setup as the cdc.

Also, what does the 'extract links up to level' function really mean? If I have it set to 2 for example it will go to 
https://search.cdc.gov/search/?query=diabetes&dpage=1 and then https://search.cdc.gov/search/?query=diabetes&dpage=2 then stop there or am i way off?

Thanks


Comments

  • SvenSven www.GSA-Online.de
    this is scpecial as searches are actually performed by javascript calls and json delivery. I make a custom search engine for it on next update.
  • Sven said:
    this is scpecial as searches are actually performed by javascript calls and json delivery. I make a custom search engine for it on next update.
    thatd be awesome thanks...same for webmd if possible

    id still like to get a better understanding of how to add sites like these and other custom sources myself. i have a lot of authority sites...some have search functions like that, some dont (ill just add them as article sites directly via their homepage?)

    where would i add the %search% thing? wherever the keyword shows up in the url after i do a manual search for it? or append it to the end no matter what or hows that work?

    it doesnt matter so much for sites that are very specific to one topic, but for ones that are mixed, i dont want to be just scraping the entire site of all pages when most will be irrelevant.

     and im still confused about the 'extract links up to level' function . should i just set that to the max of 9999 for big sites that are specifically niche relevant? 
  • SvenSven www.GSA-Online.de
    just have a look in the appdata folder scraper\article\*.ini << there you find them all and the "language" to add new sources is actually really easy to understand.

    Also what's webmd ?
  • SvenSven www.GSA-Online.de
    9999 seems a bit out of focus really. Think of the level as a way you would click on sublinks to reach the article. It doesn't seem practical to use 9999 clicks.
  • googlealchemistgooglealchemist Anywhere I want
    Sven said:
    just have a look in the appdata folder scraper\article\*.ini << there you find them all and the "language" to add new sources is actually really easy to understand.

    Also what's webmd ?
    Sven said:
    9999 seems a bit out of focus really. Think of the level as a way you would click on sublinks to reach the article. It doesn't seem practical to use 9999 clicks.
    thanks ill check out that language

    and ill adjust the clicks...maybe a few dozen makes more sense? but i was trying to think less like a personal user that would click and read a few articles and more like a data aggregator that wants to suck all of the relevant information out of the entire site

    webmd is one of if not the biggest medical 'health' realated website...maybe its more of a usa specific thing vs germany.
  • googlealchemistgooglealchemist Anywhere I want
    Sven said:
    9999 seems a bit out of focus really. Think of the level as a way you would click on sublinks to reach the article. It doesn't seem practical to use 9999 clicks.
    if the domain i am scraping is 100% dedicated to the niche that i want content for. whats the best way to scrape the entire site? if its not setting any large number of clicks...could there be an option to just crawl the entire site and pull content or go thru the sitemap if it has one or what would u recommend?


  • SvenSven www.GSA-Online.de
    You can use a RSS or sitemap URL instead and it would go through that list.
  • googlealchemistgooglealchemist Anywhere I want
    Sven said:
    You can use a RSS or sitemap URL instead and it would go through that list.
    awesome thanks ill try that for the rss feeds that i have

    is there any potential for cg to look for a feed or sitemap automatically from the root to pull from vs me having to find them manually and add them? different sites have different permutations of both
  • SvenSven www.GSA-Online.de
    I can add that on next update I guess.
  • SvenSven www.GSA-Online.de
    The latest update has improved scraping features added for sitemaps and rss feeds.
Sign In or Register to comment.