Skip to content

how to add UTF8 characters in anchors

in a project
data
first line = URL
> add URL and anchor

SER garbles anchors that have UTF8 foreign language characters
I have multilingual site including anchor-text in
ru, bg, pt, fr
and all those anchors are garbled after using edit all, then save

what is the solution in an global UTF8 world ??

I tried URL encoding - BUT my firefox did NOT find the URL when testing ...
Tagged:

Comments

  • SvenSven www.GSA-Online.de
    Hmm just "Edit All" popup menu cuts the utf8 stuff right?
  • NOT cut off
    just garbles it with ????? or so
    real UTF8 cyrillic characters (for ru and bg)  as they exist on my notepad (utf8 version) to copy / paste into SER are replaced

    other languages such as pt and fr
    only the characters with accents are garbled
  • @sven

    here is what I get exactly
    if I take my URL# cyrillic anchor1,anchor2
    put it in notepad with ANSI encoding > save and re-open same file
    that is exactly how it looks when I ADD the URLs+anchor in SER

    If I would do same in notepad with UTF8 encoding = all is perfect
    these below 2 URLs are among the problems = 1 in ru, the other in pt

    http://www.kriyayoga.com/kriya_yoga/initiation/diksha_ru.html#Дикша в Крийя Йогу,Крийя Йогу
    http://www.kriyayoga.com/kriya_yoga/initiation/diksha_pt.html#Iniciação ao Kriya Yoga

    hope that helps to solve the problem

  • Try to save to UTF-8 from notepad and import the textfile to your project
  • agmicmastermind
    tried your method
    1. import from clipboard - NOT working = again ru converted into ????
    2. import from notepad-utf8 file = at first it appeared to work
    BUT
    when I reopened the complete list again = all was garbled again, this time NOT with ????? but other weird symbols

    it is the NON-utf8 way SER saves things on HDD

    in SER it appears there are quiet a number of data files stored in ANSI instead of UTF8
    may be that is a windows system stuff
    but in our international www it might be time to move to all system being utf8

    coming from linux world incl yrs of server admin,
    since many years ALL is by default always utf8
    server HDD = file system language support
    all editors
    all config files
    all data files
    mysql
    all desktop environment
    etc

    most global web servers are default utf8,
    but a few cheap oldfashioned US based hosting still on old fashioned US based systems go ansi and garble all international stuff = means some old-fashioned outdated article directories (incl articlesnatch) do garble foreign languages in utf8
  • edited September 2013
    in linux we have a built in system tool to do any conversions in charsets

    in microsoft world I have no idea
    but a google for below shows some sources

    convert ansi to utf8

    but the key is that you are allowed to do such conversion ONLY ONCE on each file
    if you convert an already utf8 file again with conversion tools - disaster in result and all data garbled

    hence there needs to be a test FIRST if to convert file is ansi
    else
    skip conversion
  • You might want to use the file via file macro so SER does not write but only reads it. I am doing it this way with umlauts etc and it works fine.
    You can try UTF8Cast, a free tool for mass charset conversion
  • will try some day
    I need free time to study macros first

    but instead of making local fixes, I prefer permanent system fixes in SW,
    I think it is more a generic SER system problem that might need to be fixed to internationalize SER for our modern multilingual world

    may be @sven some day can fix that
  • SvenSven www.GSA-Online.de
    OK so the problem for you is "Edit All"? Well if you edit it it is converted from Unicode into ANSI and later back to Unicode. Yes sure it looks ugly if you edit it like that, but it is not destroying anything for me (even with sample URLs).
  • edited September 2013
    @sven

    your word in God's ears

    I make a test session tomorrow and see on the websites what occurs there

    but why at all does it have to be converted to ANSI at all
    in an internationalization time all is global and all could be UTF8 by default as the professional servers do it since many years
    I am pretty sure G and Y have all in UTF8 in all their apps

    when I import from file or import from clipboard and then open and look at the URLs
    I have 2 totally different scrambled anchor texts

    what file stores those URLs in SER to see exactly what happens?
    isn't there a real and final loss when UTF8 are reduced to ANSI

    when you take a rabbit
    cut OFF all legs to make rabbit smaller for storage/sleeping during night
    and in morning ... all legs GONE for good = rabbit dead


    that is as far as I understand it from a non-tech spiritual point of view ...

    ANSI uses one fixed byte to represent each character
    ANSI uses one single byte or 8 bits = maximum of 256 characters.

    vs

    UTF-8 is a multi-byte encoding scheme,  1 to 6 bytes to represent a single character.
    hence when you store more than the ansi 1 byte - 8 bits > all other bytes get LOST just as the microsoft warning says when storing a notepad display having UTF8 characters as a ANSI file.

    whereas UTF8 has 1,112,064 characters, control codes, etc that can be fully represented within UTF-8's 1-6 bytes

    so please in very short explain to me how ANSI is recovering the lost bytes
    when converting a one byte character to a 1 to 6 bytes
    may be there is an unknown newly created Msoft magic??

    could it be possible that YOU only get correct apparent output because of some hidden system memory that still has the original UTF8 stored ??

    I really want to be sure that nothing is getting lost as you are the first in the world that says so
    why else would all internationalized file systems and data systems be UTF8 which of course results in MUCH larger file-size for international texts
    if all SAME could be done with 256 bytes only ??

    storing UTF8 as ANSI CUTS off and trashes anything beyond the 1st ONE byte
    lost for all eternity
    in my understanding as a NON-coder

    more than a decade ago we had to decide the code-page when configuring a server OR desktop to meet the language needs for a single language = by excluding all other languages

    that time is gone by strictly using UTF8 all over a system
    all we the need is the fonts for languages used locally and the language support on servers
    which on typical Linux servers always includes ALL global languages by default

    may be clarification needed to have peace of mind before following your advice

    MAY BE here you find a more detailed more technical explanation better suitable for professional coders

    http://www.differencebetween.net/technology/protocols-formats/difference-between-ansi-and-utf-8/

    I fully agree with the author of above page's final Summary


  • SvenSven www.GSA-Online.de

    No need to teach me UTF8 or anything with it. The reason why I still use none unicode in most applications is that it's faster. Imagine the log e.g...if you have toi parse it for unicode it takes time, ansi is post and forget.

    I only add unicode support where I see a useful benefit from it. OK "Edit All" might come in here as it is indeed ugly but nothing to write a "book" about it ;) I will put it on todo list though.

  • @sven
    I see your point about file sizes
    my first office PC in 1985 had a floppy with some 169 kB or so, a 8 bit cpu / 5 MHz (and all commercial SW in ultra fast assembler) and OS was never crashing CPM

    but with todays high speed CPU / HDD / RAM
    in the commercial  IT world that  appears to be no longer a criteria as speed compensates by far for file size

    processing time  is won or lost in programming language and above all by desktop environments and GUI's

    how else could mega companies like google etc make UTF8 choices if overall performance would be worst than ansi

    and in an international world the advantages of overall default UTF8 are much greater than the file size
    see your other customer with his greek character problems
    and see all the missing potential customers from NON-ansi countries

    and
    even Google appears to have problems with NON-UTF8 stuff

    a test G search - comparison of Google search results between the 2 footprints below
    • inurl:"특수기능:로그인" wiki
    • and same from NON-UTF8 world
    • inurl:"%ED%8A%B9%EC%88%98%EA%B8%B0%EB%8A%A5:%EB%A1%9C%EA%B7%B8%EC%9D%B8" wiki
    the second footprint is from SER
    the first from today's work to convert URL encoded footprints into regular original characters for my offline URL filtering used for SB harveted target sites

    what ever
    have a nice day

Sign In or Register to comment.