Filter out non-English characters from collected keywords

varthdaver · September 2014

Somehow it seems I'm gathering a lot of non-english keywords.
I know i can stop collecting keywords from target sites, and using them to find new target sites, but I'd like to keep doing this to ensure a broad keyword list.
But the non-English keywords only slow down the process of finding new sites, because i'll block whatever it finds as soon as it discovers that the page isn't in English.
Would it be possible to filter the language of the keywords, as well as the target pages? Or at least exclude any non-English / high-unicode characters from the anchor texts?

Sven · September 2014

Naw I don't want this really as there are many customers using SER for none English sites/SEO.

varthdaver · September 2014

Hmm, yeah i was thinking about this even after i wrote it.

But if you look through the unicode character ranges, there are codepages that are solely for specific languages. The chinese, arabic, & indian scripts for example. See http://www.unicode.org/charts/ and https://en.wikipedia.org/wiki/Code_page

So it would still be very helpful to be able to restrict the anchor texts by codepage.

I figure the code you'll need is in here: http://www.codeproject.com/Articles/17201/Detect-Encoding-for-In-and-Outgoing-Text

Thanks for considering this.

Filter out non-English characters from collected keywords

Comments