Skip to content

Archive.org article scraper

I know the Content Generator can already go to expireddomains.net and lookup domains from keywords provided and go off to archive.org and attempt to scrape them.

But say I have a list of expired domains, can I feed this list into GSA Content Generator and have it search through archive.org ?

Comments

  • SvenSven www.GSA-Online.de
    no, thats not possible right now.
  • Would this be hard to add as a new feature?  It could even run in parrallel with expired domain crawlers if it could read from a dynamic txt file etc.
  • i am also interested in this feature.
  • I have a Problem with expireddomains.net . When i have a german language project with german keywords i get this error message for all articles:

    Unwanted language "en" detected for http://web.archive.org/web/20171030052420...
  • SvenSven www.GSA-Online.de
    You also wrote me an email...the lang="en" is part of the websites you sent in log. Some have lang="de" and if there is no detection possible, then CG will do that by domain/ip which is of course a problem on this site.
  • A very good topic to revive!

    Indeed, is it possible to add such a function so that you can import a list of domains and get the output articles in text files? Can I still use this program to find content through the web archive?
  • SvenSven www.GSA-Online.de
    yes, it should still be possible without problems.
Sign In or Register to comment.