Filter out non-English characters from collected keywords
varthdaver
Sydney, Australia
Somehow it seems I'm gathering a lot of non-english keywords.
I know i can stop collecting keywords from target sites, and using them to find new target sites, but I'd like to keep doing this to ensure a broad keyword list.
But the non-English keywords only slow down the process of finding new sites, because i'll block whatever it finds as soon as it discovers that the page isn't in English.
Would it be possible to filter the language of the keywords, as well as the target pages? Or at least exclude any non-English / high-unicode characters from the anchor texts?
I know i can stop collecting keywords from target sites, and using them to find new target sites, but I'd like to keep doing this to ensure a broad keyword list.
But the non-English keywords only slow down the process of finding new sites, because i'll block whatever it finds as soon as it discovers that the page isn't in English.
Would it be possible to filter the language of the keywords, as well as the target pages? Or at least exclude any non-English / high-unicode characters from the anchor texts?
Comments
But if you look through the unicode character ranges, there are codepages that are solely for specific languages. The chinese, arabic, & indian scripts for example. See http://www.unicode.org/charts/ and https://en.wikipedia.org/wiki/Code_page
So it would still be very helpful to be able to restrict the anchor texts by codepage.
I figure the code you'll need is in here: http://www.codeproject.com/Articles/17201/Detect-Encoding-for-In-and-Outgoing-Text
Thanks for considering this.