Help with custom source setups
googlealchemist
Anywhere I want
I'm testing different custom content sources but I'm not sure how best to add them for what I want
For example, if I want a bunch of articles on diabetes, I go to the cdc website which has tons of related content.
I've tried adding the following urls to my custom sources section, and have the keyword 'diabetes' enabled to search with
For example, if I want a bunch of articles on diabetes, I go to the cdc website which has tons of related content.
I've tried adding the following urls to my custom sources section, and have the keyword 'diabetes' enabled to search with
https://search.cdc.gov/search/
Some of those urls give back some diabetes stuff but even the ones that do there is also tons of irrelevant stuff its going thru.
If I start on their homepage https://www.cdc.gov/ there is a search box...if I search for diabetes it takes me to their search subdomain referenced above https://search.cdc.gov with all the content on diabetes
But If I just use https://cdc.gov as a content source it just starts scraping the entire site which is about tons of stuff other than diabetes.
How would I structure this to pull the right data from this site?
I'm having similar issues with webmd which has a similar setup as the cdc.
Also, what does the 'extract links up to level' function really mean? If I have it set to 2 for example it will go to
https://search.cdc.gov/search/?query=diabetes&dpage=1 and then https://search.cdc.gov/search/?query=diabetes&dpage=2 then stop there or am i way off?
Thanks
Some of those urls give back some diabetes stuff but even the ones that do there is also tons of irrelevant stuff its going thru.
If I start on their homepage https://www.cdc.gov/ there is a search box...if I search for diabetes it takes me to their search subdomain referenced above https://search.cdc.gov with all the content on diabetes
But If I just use https://cdc.gov as a content source it just starts scraping the entire site which is about tons of stuff other than diabetes.
How would I structure this to pull the right data from this site?
I'm having similar issues with webmd which has a similar setup as the cdc.
Also, what does the 'extract links up to level' function really mean? If I have it set to 2 for example it will go to
https://search.cdc.gov/search/?query=diabetes&dpage=1 and then https://search.cdc.gov/search/?query=diabetes&dpage=2 then stop there or am i way off?
Thanks
Comments
id still like to get a better understanding of how to add sites like these and other custom sources myself. i have a lot of authority sites...some have search functions like that, some dont (ill just add them as article sites directly via their homepage?)
where would i add the %search% thing? wherever the keyword shows up in the url after i do a manual search for it? or append it to the end no matter what or hows that work?
it doesnt matter so much for sites that are very specific to one topic, but for ones that are mixed, i dont want to be just scraping the entire site of all pages when most will be irrelevant.
and im still confused about the 'extract links up to level' function . should i just set that to the max of 9999 for big sites that are specifically niche relevant?
and ill adjust the clicks...maybe a few dozen makes more sense? but i was trying to think less like a personal user that would click and read a few articles and more like a data aggregator that wants to suck all of the relevant information out of the entire site
webmd is one of if not the biggest medical 'health' realated website...maybe its more of a usa specific thing vs germany.
is there any potential for cg to look for a feed or sitemap automatically from the root to pull from vs me having to find them manually and add them? different sites have different permutations of both