Content Generator: Scrape all Articles from a custom domain

tanuki · October 2019

Hey guys,
is that possible to scrape all Articles from a domain? I tried the search function, but i cant find a answer that can help me.

What i try to do:

i have a domain with 150 articles. I want to insert the domain.tld to scrape with the content generator all of these articles from all urls.

Can anyone help me with that problem?

Thanks a lot!

Sven · October 2019

just add it as custom source

tanuki · October 2019

I tried that. But i get the Message: Sorry, not enough content to create any article. Try to select more sources or use more keywords.

tanuki · October 2019

Isnt it posssible to scrape without enter a keyword? i selected 129 sources for that. but it didnt extract that content fro mthe page

Sven · October 2019

can you paste some log lines when scraping your custom source site? Or paste some screenshot on how you configured it.

przamunda · October 2019

Hi @Sven

I'm interested in this too.

What exactly would be the process? When I create a GSC project it asks me to enter at least one keyword.

My process is as follows:

I create a project

I enter any keyword

I add the source (s) and set them as extract

In output I choose "same article" and in "number of words" I set it from 10-20000

And the error it gives me is: sorry not enough content to create any article

tanuki · October 2019

i worked with entereing with some keywords, like words for a question: What can, what is, what about.

Phrases for Homeimprovement

tanuki · October 2019

@przamunda

"In output I choose "same article" and in "number of words" I set it from 10-20000 "

I think 20000 is too much set it on 150-300 and test again

Sven · October 2019

Algorithm: "Same Article"

Number of Words: 1-100000 (to get all, long or short articles)

Keyword: Use related keywords or at least some conman words like "and" "or" "a"...

przamunda · October 2019

I've tried it but it doesn't work very well. The workaround that is working for me is to scrape the sitemap from the page I want to use as source and use the extracted pages as sources.

Maybe an idea would be to be able to add sitemaps as sources.

Sven · October 2019

Well I don't know the structure of the site, but maybe it is having articles away on to many sublink-clicks?

przamunda · October 2019

@tanuki, thanks for your input :-)

@Sven I DM them to you

henningnet · May 2020

Sorry for digging out this old thread but I am trying to archive the same at the moment.

The idea to have the sitemap as a custom source would be awesome. Just that GSA then scrapes the articles on the sitemap, does not leave the site and no keywords are needed. Just plain downloading articles.

Sven · May 2020

Thats already working. GSA Content Generator can use your sitemap or rss-feed URL as source. It would go through that structure and extract link to parse them for articles.

henningnet · May 2020

Ok cool, will try that out. As keywords I would put a list of keywords (lets say in German) that are included in every article ever and it would download them?

Like
der
die
das
und

Sven · May 2020

yes that would work. Though you would still get "articles" that might not be related at all.

Larus123 · October 2022

push this topic again. What would be the Setting for use of Keyword and Output. The Tool quiet offen Scrape Thoussands of article i can See That it Count the Chars.but then no article was created. I.e used List of keywords like Stop Word List and Numbers Domain Name Keyword List from Scrape. But only Ende up with 0 or small no. Of article even when 50k URLs are used in Custom source. I want 100% copy or Text as i will Paraphrase and summarize with other Tool.

slqlsm · April 2023

I'm interested in this too.

how can i get all " articles " from a domain, i set up custom source " that domain " and the focust keyword but nothing scraping.see log file

Sven · April 2023

The log just the end. Please disable all other scrapers and only use the custom one. Then there might be something useful in the log.

Also show how you setup the custom source please.

rastarr · April 2023

@slqlsm - To add to what @Sven has just said, you could also view the site's source code and search for 'rss'. add the feed as a custom source and see if that works for you too. I recently did this on a few sites for additional content

slqlsm · April 2023

Sven said:

The log just the end. Please disable all other scrapers and only use the custom one. Then there might be something useful in the log.
Also show how you setup the custom source please.

I had uncheck all regular source and use only my custom source, see log and 3 pictures

Image: https://forum.gsa-online.de/uploads/editor/a9/pxdgs64auvqf.png

Sven · April 2023

Please edit the custom source again and uncheck "It's a search engine"

slqlsm · April 2023

Image: https://forum.gsa-online.de/uploads/editor/84/9gz9iv45oo14.jpg

Still error ,please help me

Sven · April 2023

I am able to scrape the whole site with that setting. Please check your proxy setup and maybe disable proxies to see if that helps.

slqlsm · April 2023

hello, thanks, the problem from the gsa proxy

Content Generator: Scrape all Articles from a custom domain

Comments