how to add UTF8 characters in anchors

hans51 · September 2013

in a project
data
first line = URL
> add URL and anchor

SER garbles anchors that have UTF8 foreign language characters
I have multilingual site including anchor-text in
ru, bg, pt, fr
and all those anchors are garbled after using edit all, then save

what is the solution in an global UTF8 world ??

I tried URL encoding - BUT my firefox did NOT find the URL when testing ...

Sven · September 2013

Hmm just "Edit All" popup menu cuts the utf8 stuff right?

hans51 · September 2013

NOT cut off
just garbles it with ????? or so
real UTF8 cyrillic characters (for ru and bg) as they exist on my notepad (utf8 version) to copy / paste into SER are replaced

other languages such as pt and fr
only the characters with accents are garbled

hans51 · September 2013

@sven

here is what I get exactly
if I take my URL# cyrillic anchor1,anchor2
put it in notepad with ANSI encoding > save and re-open same file
that is exactly how it looks when I ADD the URLs+anchor in SER

If I would do same in notepad with UTF8 encoding = all is perfect
these below 2 URLs are among the problems = 1 in ru, the other in pt

http://www.kriyayoga.com/kriya_yoga/initiation/diksha_ru.html#Дикша в Крийя Йогу,Крийя Йогу
http://www.kriyayoga.com/kriya_yoga/initiation/diksha_pt.html#Iniciação ao Kriya Yoga

hope that helps to solve the problem

MrX · September 2013

Try to save to UTF-8 from notepad and import the textfile to your project

hans51 · September 2013

agmicmastermind
tried your method
1. import from clipboard - NOT working = again ru converted into ????
2. import from notepad-utf8 file = at first it appeared to work
BUT
when I reopened the complete list again = all was garbled again, this time NOT with ????? but other weird symbols

it is the NON-utf8 way SER saves things on HDD

in SER it appears there are quiet a number of data files stored in ANSI instead of UTF8
may be that is a windows system stuff
but in our international www it might be time to move to all system being utf8

coming from linux world incl yrs of server admin,
since many years ALL is by default always utf8
server HDD = file system language support
all editors
all config files
all data files
mysql
all desktop environment
etc

most global web servers are default utf8,
but a few cheap oldfashioned US based hosting still on old fashioned US based systems go ansi and garble all international stuff = means some old-fashioned outdated article directories (incl articlesnatch) do garble foreign languages in utf8

hans51 · September 2013

in linux we have a built in system tool to do any conversions in charsets

in microsoft world I have no idea
but a google for below shows some sources

convert ansi to utf8

but the key is that you are allowed to do such conversion ONLY ONCE on each file
if you convert an already utf8 file again with conversion tools - disaster in result and all data garbled

hence there needs to be a test FIRST if to convert file is ansi
else
skip conversion

MrX · September 2013

You might want to use the file via file macro so SER does not write but only reads it. I am doing it this way with umlauts etc and it works fine.
You can try UTF8Cast, a free tool for mass charset conversion

hans51 · September 2013

will try some day
I need free time to study macros first

but instead of making local fixes, I prefer permanent system fixes in SW,
I think it is more a generic SER system problem that might need to be fixed to internationalize SER for our modern multilingual world

may be @sven some day can fix that

Sven · September 2013

OK so the problem for you is "Edit All"? Well if you edit it it is converted from Unicode into ANSI and later back to Unicode. Yes sure it looks ugly if you edit it like that, but it is not destroying anything for me (even with sample URLs).

hans51 · September 2013

@sven

your word in God's ears

I make a test session tomorrow and see on the websites what occurs there

but why at all does it have to be converted to ANSI at all
in an internationalization time all is global and all could be UTF8 by default as the professional servers do it since many years
I am pretty sure G and Y have all in UTF8 in all their apps

when I import from file or import from clipboard and then open and look at the URLs
I have 2 totally different scrambled anchor texts

what file stores those URLs in SER to see exactly what happens?
isn't there a real and final loss when UTF8 are reduced to ANSI

when you take a rabbit
cut OFF all legs to make rabbit smaller for storage/sleeping during night
and in morning ... all legs GONE for good = rabbit dead

that is as far as I understand it from a non-tech spiritual point of view ...

ANSI uses one fixed byte to represent each character
ANSI uses one single byte or 8 bits = maximum of 256 characters.

vs

UTF-8 is a multi-byte encoding scheme, 1 to 6 bytes to represent a single character.
hence when you store more than the ansi 1 byte - 8 bits > all other bytes get LOST just as the microsoft warning says when storing a notepad display having UTF8 characters as a ANSI file.

whereas UTF8 has 1,112,064 characters, control codes, etc that can be fully represented within UTF-8's 1-6 bytes

so please in very short explain to me how ANSI is recovering the lost bytes
when converting a one byte character to a 1 to 6 bytes
may be there is an unknown newly created Msoft magic??

could it be possible that YOU only get correct apparent output because of some hidden system memory that still has the original UTF8 stored ??

I really want to be sure that nothing is getting lost as you are the first in the world that says so
why else would all internationalized file systems and data systems be UTF8 which of course results in MUCH larger file-size for international texts
if all SAME could be done with 256 bytes only ??

storing UTF8 as ANSI CUTS off and trashes anything beyond the 1st ONE byte
lost for all eternity
in my understanding as a NON-coder

more than a decade ago we had to decide the code-page when configuring a server OR desktop to meet the language needs for a single language = by excluding all other languages

that time is gone by strictly using UTF8 all over a system
all we the need is the fonts for languages used locally and the language support on servers
which on typical Linux servers always includes ALL global languages by default

may be clarification needed to have peace of mind before following your advice

MAY BE here you find a more detailed more technical explanation better suitable for professional coders

http://www.differencebetween.net/technology/protocols-formats/difference-between-ansi-and-utf-8/

I fully agree with the author of above page's final Summary

Sven · September 2013

No need to teach me UTF8 or anything with it. The reason why I still use none unicode in most applications is that it's faster. Imagine the log e.g...if you have toi parse it for unicode it takes time, ansi is post and forget.

I only add unicode support where I see a useful benefit from it. OK "Edit All" might come in here as it is indeed ugly but nothing to write a "book" about it I will put it on todo list though.

hans51 · September 2013

@sven
I see your point about file sizes
my first office PC in 1985 had a floppy with some 169 kB or so, a 8 bit cpu / 5 MHz (and all commercial SW in ultra fast assembler) and OS was never crashing CPM

but with todays high speed CPU / HDD / RAM
in the commercial IT world that appears to be no longer a criteria as speed compensates by far for file size

processing time is won or lost in programming language and above all by desktop environments and GUI's

how else could mega companies like google etc make UTF8 choices if overall performance would be worst than ansi

and in an international world the advantages of overall default UTF8 are much greater than the file size
see your other customer with his greek character problems
and see all the missing potential customers from NON-ansi countries

and
even Google appears to have problems with NON-UTF8 stuff

a test G search - comparison of Google search results between the 2 footprints below

inurl:"특수기능:로그인" wiki
and same from NON-UTF8 world
inurl:"%ED%8A%B9%EC%88%98%EA%B8%B0%EB%8A%A5:%EB%A1%9C%EA%B7%B8%EC%9D%B8" wiki

the second footprint is from SER
the first from today's work to convert URL encoded footprints into regular original characters for my offline URL filtering used for SB harveted target sites

what ever
have a nice day

how to add UTF8 characters in anchors

Comments