Help with custom source setups

September 2022

I'm testing different custom content sources but I'm not sure how best to add them for what I want

For example, if I want a bunch of articles on diabetes, I go to the cdc website which has tons of related content.

I've tried adding the following urls to my custom sources section, and have the keyword 'diabetes' enabled to search with

https://search.cdc.gov/search/?query=%search%

https://search.cdc.gov/search/?query=%search%&dpage=1

https://search.cdc.gov/search/?query=diabetes&dpage=1

https://search.cdc.gov/

https://search.cdc.gov/search/

Some of those urls give back some diabetes stuff but even the ones that do there is also tons of irrelevant stuff its going thru.

If I start on their homepage https://www.cdc.gov/ there is a search box...if I search for diabetes it takes me to their search subdomain referenced above https://search.cdc.gov with all the content on diabetes

But If I just use https://cdc.gov as a content source it just starts scraping the entire site which is about tons of stuff other than diabetes.

How would I structure this to pull the right data from this site?

I'm having similar issues with webmd which has a similar setup as the cdc.

Also, what does the 'extract links up to level' function really mean? If I have it set to 2 for example it will go to
https://search.cdc.gov/search/?query=diabetes&dpage=1 and then https://search.cdc.gov/search/?query=diabetes&dpage=2 then stop there or am i way off?

Thanks

September 2022

this is scpecial as searches are actually performed by javascript calls and json delivery. I make a custom search engine for it on next update.

September 2022

Sven said:

this is scpecial as searches are actually performed by javascript calls and json delivery. I make a custom search engine for it on next update.

thatd be awesome thanks...same for webmd if possible

id still like to get a better understanding of how to add sites like these and other custom sources myself. i have a lot of authority sites...some have search functions like that, some dont (ill just add them as article sites directly via their homepage?)

where would i add the %search% thing? wherever the keyword shows up in the url after i do a manual search for it? or append it to the end no matter what or hows that work?

it doesnt matter so much for sites that are very specific to one topic, but for ones that are mixed, i dont want to be just scraping the entire site of all pages when most will be irrelevant.

and im still confused about the 'extract links up to level' function . should i just set that to the max of 9999 for big sites that are specifically niche relevant?

September 2022

just have a look in the appdata folder scraper\article\*.ini << there you find them all and the "language" to add new sources is actually really easy to understand.

Also what's webmd ?

September 2022

9999 seems a bit out of focus really. Think of the level as a way you would click on sublinks to reach the article. It doesn't seem practical to use 9999 clicks.

October 2022

Sven said:

just have a look in the appdata folder scraper\article\*.ini << there you find them all and the "language" to add new sources is actually really easy to understand.

Also what's webmd ?

Sven said:

9999 seems a bit out of focus really. Think of the level as a way you would click on sublinks to reach the article. It doesn't seem practical to use 9999 clicks.

thanks ill check out that language

and ill adjust the clicks...maybe a few dozen makes more sense? but i was trying to think less like a personal user that would click and read a few articles and more like a data aggregator that wants to suck all of the relevant information out of the entire site

webmd is one of if not the biggest medical 'health' realated website...maybe its more of a usa specific thing vs germany.

November 2022

Sven said:

9999 seems a bit out of focus really. Think of the level as a way you would click on sublinks to reach the article. It doesn't seem practical to use 9999 clicks.

if the domain i am scraping is 100% dedicated to the niche that i want content for. whats the best way to scrape the entire site? if its not setting any large number of clicks...could there be an option to just crawl the entire site and pull content or go thru the sitemap if it has one or what would u recommend?

November 2022

You can use a RSS or sitemap URL instead and it would go through that list.

November 2022

Sven said:

You can use a RSS or sitemap URL instead and it would go through that list.

awesome thanks ill try that for the rss feeds that i have

is there any potential for cg to look for a feed or sitemap automatically from the root to pull from vs me having to find them manually and add them? different sites have different permutations of both

November 2022

I can add that on next update I guess.

November 2022

The latest update has improved scraping features added for sitemaps and rss feeds.

Help with custom source setups

Comments