how to add UTF8 characters in anchors
in a project
data
first line = URL
> add URL and anchor
SER garbles anchors that have UTF8 foreign language characters
I have multilingual site including anchor-text in
ru, bg, pt, fr
and all those anchors are garbled after using edit all, then save
what is the solution in an global UTF8 world ??
I tried URL encoding - BUT my firefox did NOT find the URL when testing ...
data
first line = URL
> add URL and anchor
SER garbles anchors that have UTF8 foreign language characters
I have multilingual site including anchor-text in
ru, bg, pt, fr
and all those anchors are garbled after using edit all, then save
what is the solution in an global UTF8 world ??
I tried URL encoding - BUT my firefox did NOT find the URL when testing ...
Tagged:
Comments
just garbles it with ????? or so
real UTF8 cyrillic characters (for ru and bg) as they exist on my notepad (utf8 version) to copy / paste into SER are replaced
other languages such as pt and fr
only the characters with accents are garbled
here is what I get exactly
if I take my URL# cyrillic anchor1,anchor2
put it in notepad with ANSI encoding > save and re-open same file
that is exactly how it looks when I ADD the URLs+anchor in SER
If I would do same in notepad with UTF8 encoding = all is perfect
these below 2 URLs are among the problems = 1 in ru, the other in pt
http://www.kriyayoga.com/kriya_yoga/initiation/diksha_ru.html#Дикша в Крийя Йогу,Крийя Йогу
http://www.kriyayoga.com/kriya_yoga/initiation/diksha_pt.html#Iniciação ao Kriya Yoga
hope that helps to solve the problem
tried your method
1. import from clipboard - NOT working = again ru converted into ????
2. import from notepad-utf8 file = at first it appeared to work
BUT
when I reopened the complete list again = all was garbled again, this time NOT with ????? but other weird symbols
it is the NON-utf8 way SER saves things on HDD
in SER it appears there are quiet a number of data files stored in ANSI instead of UTF8
may be that is a windows system stuff
but in our international www it might be time to move to all system being utf8
coming from linux world incl yrs of server admin,
since many years ALL is by default always utf8
server HDD = file system language support
all editors
all config files
all data files
mysql
all desktop environment
etc
most global web servers are default utf8,
but a few cheap oldfashioned US based hosting still on old fashioned US based systems go ansi and garble all international stuff = means some old-fashioned outdated article directories (incl articlesnatch) do garble foreign languages in utf8
in microsoft world I have no idea
but a google for below shows some sources
convert ansi to utf8
but the key is that you are allowed to do such conversion ONLY ONCE on each file
if you convert an already utf8 file again with conversion tools - disaster in result and all data garbled
hence there needs to be a test FIRST if to convert file is ansi
else
skip conversion
You can try UTF8Cast, a free tool for mass charset conversion
I need free time to study macros first
but instead of making local fixes, I prefer permanent system fixes in SW,
I think it is more a generic SER system problem that might need to be fixed to internationalize SER for our modern multilingual world
may be @sven some day can fix that
your word in God's ears
I make a test session tomorrow and see on the websites what occurs there
but why at all does it have to be converted to ANSI at all
in an internationalization time all is global and all could be UTF8 by default as the professional servers do it since many years
I am pretty sure G and Y have all in UTF8 in all their apps
when I import from file or import from clipboard and then open and look at the URLs
I have 2 totally different scrambled anchor texts
what file stores those URLs in SER to see exactly what happens?
isn't there a real and final loss when UTF8 are reduced to ANSI
when you take a rabbit
cut OFF all legs to make rabbit smaller for storage/sleeping during night
and in morning ... all legs GONE for good = rabbit dead
that is as far as I understand it from a non-tech spiritual point of view ...
ANSI uses one fixed byte to represent each character
ANSI uses one single byte or 8 bits = maximum of 256 characters.
vs
UTF-8 is a multi-byte encoding scheme, 1 to 6 bytes to represent a single character.
hence when you store more than the ansi 1 byte - 8 bits > all other bytes get LOST just as the microsoft warning says when storing a notepad display having UTF8 characters as a ANSI file.
whereas UTF8 has 1,112,064 characters, control codes, etc that can be fully represented within UTF-8's 1-6 bytes
so please in very short explain to me how ANSI is recovering the lost bytes
when converting a one byte character to a 1 to 6 bytes
may be there is an unknown newly created Msoft magic??
could it be possible that YOU only get correct apparent output because of some hidden system memory that still has the original UTF8 stored ??
I really want to be sure that nothing is getting lost as you are the first in the world that says so
why else would all internationalized file systems and data systems be UTF8 which of course results in MUCH larger file-size for international texts
if all SAME could be done with 256 bytes only ??
storing UTF8 as ANSI CUTS off and trashes anything beyond the 1st ONE byte
lost for all eternity
in my understanding as a NON-coder
more than a decade ago we had to decide the code-page when configuring a server OR desktop to meet the language needs for a single language = by excluding all other languages
that time is gone by strictly using UTF8 all over a system
all we the need is the fonts for languages used locally and the language support on servers
which on typical Linux servers always includes ALL global languages by default
may be clarification needed to have peace of mind before following your advice
MAY BE here you find a more detailed more technical explanation better suitable for professional coders
http://www.differencebetween.net/technology/protocols-formats/difference-between-ansi-and-utf-8/
I fully agree with the author of above page's final Summary
No need to teach me UTF8 or anything with it. The reason why I still use none unicode in most applications is that it's faster. Imagine the log e.g...if you have toi parse it for unicode it takes time, ansi is post and forget.
I only add unicode support where I see a useful benefit from it. OK "Edit All" might come in here as it is indeed ugly but nothing to write a "book" about it I will put it on todo list though.
I see your point about file sizes
my first office PC in 1985 had a floppy with some 169 kB or so, a 8 bit cpu / 5 MHz (and all commercial SW in ultra fast assembler) and OS was never crashing CPM
but with todays high speed CPU / HDD / RAM
in the commercial IT world that appears to be no longer a criteria as speed compensates by far for file size
processing time is won or lost in programming language and above all by desktop environments and GUI's
how else could mega companies like google etc make UTF8 choices if overall performance would be worst than ansi
and in an international world the advantages of overall default UTF8 are much greater than the file size
see your other customer with his greek character problems
and see all the missing potential customers from NON-ansi countries
and
even Google appears to have problems with NON-UTF8 stuff
a test G search - comparison of Google search results between the 2 footprints below
- inurl:"특수기능:로그인" wiki
- and same from NON-UTF8 world
- inurl:"%ED%8A%B9%EC%88%98%EA%B8%B0%EB%8A%A5:%EB%A1%9C%EA%B7%B8%EC%9D%B8" wiki
the second footprint is from SERthe first from today's work to convert URL encoded footprints into regular original characters for my offline URL filtering used for SB harveted target sites
what ever
have a nice day