Can GSA ignore the language of a source?

Hey guys, hey Sven,

I tried to download like 450 articles (custom sources via URL) from one site with the "Same Article function", which is in German.

In the source code its
<html class="no-js" lang="en-US"> <!--<![endif]-->

Of course all articles are in German. So GSA won't download any articles when its set to German, because it tells me that its an unwanted language in the source.

Changing the setting in GSA to "English" then gives me "unwanted unicode language DE/SI etc". So it again does not download the articles.

Is there any way to tell GSA to ignore the language? Lets say I would like to scrape a multilanguage website also, for this it would make sense?

  • SvenSven
    next update allows you to ignore language checks on custom sources.
  • At the moment, when downloading a full article via "same article" leads GSA to 1. not download the full article (only a paragraph or two) or saying "unrelevant content". As keyword I put in "der" which is a ton times included in each article. Of course.

    Any idea whats happening here? Filter is empty. Number of articles is set to Max, number of words is set to 1-30000, everything else is unchecked.
  • SvenSven
    send me the project backup with the sources that make problems.
  • Also it looks like when I give GSA like 50 sources the software only downloads 12, or 15 or sometimes 19. Log does not show any errors or why GSA is not processing the other custom sources.
  • Can we have this "Ignore Language checks" flag for normal campaigns also as and advanced option? Thank you.
  • SvenSven
    I don't think this is a good idea. Because you often get a lot search engine results not being the wanted language and then you end up with crap if that check is not performed to filter it out.
  • Ok, can we maybe make it at least accept sites that have default en-US as an option? Because I have the same thing as described by the OP, but for a normal search with foreign keywords - I see a bunch of good sites left behind because of the "unwanted language" thing.
    I'm willing to try this as beta and check the results I'll get if you want.
  • SvenSven
    can you maybe first send me some sample urls where the detection failed? Because if there is a definition of "en" inside the source, the language is detected differently as well due to the fact that many just have that in as default by mistake.
  • I can confirm that many sites in french language have the html set to lang="en-us"
