Skip to content

Content Generator: Scrape all Articles from a custom domain

Hey guys,
is that possible to scrape all Articles from a domain? I tried the search function, but i cant find a answer that can help me.

What i try to do:

i have a domain with 150 articles. I want to insert the domain.tld to scrape with the content generator all of these articles from all urls.

Can anyone help me with that problem?

Thanks a lot!

Comments

  • SvenSven www.GSA-Online.de
    just add it as custom source
  • I tried that. But i get the Message: Sorry, not enough content to create any article. Try to select more sources or use more keywords.
  • Isnt it posssible to scrape without enter a keyword? i selected 129 sources for that. but it didnt extract that content fro mthe page
  • SvenSven www.GSA-Online.de
    can you paste some log lines when scraping your custom source site? Or paste some screenshot on how you configured it.
  • Hi @Sven

    I'm interested in this too.

    What exactly would be the process?  When I create a GSC project it asks me to enter at least one keyword.

    My process is as follows:

    I create a project
    I enter any keyword
    I add the source (s) and set them as extract
    In output I choose "same article" and in "number of words" I set it from 10-20000

    And the error it gives me is: sorry not enough content to create any article
  • i worked with entereing with some keywords, like words for a question: What can, what is, what about.

    Phrases for Homeimprovement
  • @przamunda

    "In output I choose "same article" and in "number of words" I set it from 10-20000 "

    I think 20000 is too much set it on 150-300 and test again
  • SvenSven www.GSA-Online.de
    Algorithm: "Same Article"
    Number of Words: 1-100000 (to get all, long or short articles)
    Keyword: Use related keywords or at least some conman words like "and" "or" "a"...
  • I've tried it but it doesn't work very well. The workaround that is working for me is to scrape the sitemap from the page I want to use as source and use the extracted pages as sources.

    Maybe an idea would be to be able to add sitemaps as sources.
  • SvenSven www.GSA-Online.de
    Well I don't know the structure of the site, but maybe it is having articles away on to many sublink-clicks?
  • @tanuki, thanks for your input :-)

    @Sven I DM them to you
  • Sorry for digging out this old thread but I am trying to archive the same at the moment.

    The idea to have the sitemap as a custom source would be awesome. Just that GSA then scrapes the articles on the sitemap, does not leave the site and no keywords are needed. Just plain downloading articles. 
  • SvenSven www.GSA-Online.de
    Thats already working. GSA Content Generator can use your sitemap or rss-feed URL as source. It would go through that structure and extract link to parse them for articles.
  • Ok cool, will try that out. As keywords I would put a list of keywords (lets say in German) that are included in every article ever and it would download them?

    Like
    der
    die
    das
    und
  • SvenSven www.GSA-Online.de
    yes that would work. Though you would still get "articles" that might not be related at all.
  • Larus123Larus123 Germany
    edited October 2022
    push this topic again. What would be the Setting for use of Keyword and Output. The Tool quiet offen Scrape Thoussands of article i can See That it Count the Chars.but then no article was created. I.e used List of keywords like Stop Word List and Numbers Domain Name Keyword List from Scrape. But only Ende up with 0 or small  no. Of article even when 50k URLs are used in Custom source. I want 100% copy or Text as i will Paraphrase and summarize with other Tool.
  • slqlsmslqlsm Viet Nam
    edited April 2023
    I'm interested in this too.
    how can i get all " articles " from a domain, i set up custom source " that domain " and the focust keyword but nothing scraping.see log file
    a.log 12.5K
  • SvenSven www.GSA-Online.de
    The log just the end. Please disable all other scrapers and only use the custom one. Then there might be something useful in the log.
    Also show how you setup the custom source please.
  • rastarrrastarr Thailand
    @slqlsm - To add to what @Sven has just said, you could also view the site's source code and search for 'rss'. add the feed as a custom source and see if that works for you too. I recently did this on a few sites for additional content
  • slqlsmslqlsm Viet Nam
    Sven said:
    The log just the end. Please disable all other scrapers and only use the custom one. Then there might be something useful in the log.
    Also show how you setup the custom source please.
    I had uncheck all regular source and use only my custom source, see log and 3 pictures
    a.log 1.4K
  • SvenSven www.GSA-Online.de
    Please edit the custom source again and uncheck "It's a search engine"
  • slqlsmslqlsm Viet Nam


    Still error ,please help me
    a.log 644B
  • SvenSven www.GSA-Online.de
    I am able to scrape the whole site with that setting. Please check your proxy setup and maybe disable proxies to see if that helps.
  • slqlsmslqlsm Viet Nam
    hello, thanks, the problem from the gsa proxy
Sign In or Register to comment.