Why UTF-8 and Not ASCII for Portuguese? (PART I)

A fellow blogger on Python-Brasil:

The colleagues have already talked about why UTF-8.

I just want to remind that the subject is more complicated than it seems, for example in Python 2.7:

# -*- coding: utf-8 -*-
print "Accents: áéíóúãõç"
print u"Accents2: áéíóúãõç"

Run the program above on Windows, either through IDLE or console:

C:\Users\nilo\Desktop>\Python27\python.exe test.py
Accents: ├í├®├¡├│├║├º├ú├Á
Accents2: áéíóúçãõ

You should have obtained good results only on the Accents2 line. If the string is not marked with unicode, it will be simply printed as a sequence of bytes, without translation. If you have u in front, like in accents2, Python gets that it needs to translate from unicode to cp850, in this case of console here at home. Already on Linux, both lines produce correct results!

The encoding: utf-8 informs only the coding of the source code. That is, it’s just a hint on how characters should be coded. To function correctly, your text editor must be configured for UTF-8 as well. If you mix them up, disaster is sure to come! I recommend PSPad on Windows for editing with UTF-8. To verify the encoding of an unknown file or to make sure what coding your editor actually used, use a binary viewer like HxD [3]. In the hex edit of PS Pad, pay attention that it shows characters in Unicode, even if the coding is UTF-8. This to remind that UTF-8 is a representation or form of encoding of Unicode characters. Notepad++ can also be used for editing and coding files in UTF-8.

On Mac and Linux, try hexdump -C file

When the file is coded correctly in utf-8, you should have more than one byte for accented characters:

For example, the program above, created on Ubuntu’s vim:

nilo@linuxvm:~$ hexdump -C test.py
00000000  # -*-       coding:   u|      
00000010  tf-8       -*-.print |      
00000020  Acentos: .......|      
00000030  .           ".print|      
00000040  "Acentos2: ....|      
00000050  ....        "..     |     
0000005f

A great website is this one: http://www.utf8-chartable.de/

Once the problem of coding sources has been resolved, there remain:

Console encoding
Data file encoding
Database encoding

Both Mac and Linux use UTF-8 by default. Windows uses cp 1252 (GUI), compatible with iso8859_1. Be careful when exchanging files between Windows, Linux, and Mac machines. And never mix two codings in the same file, as this generates errors difficult to detect and resolve.

It’s easy to mix things up when appending a file, coming from another machine or generated by another program.

Windows in Chinese, Russian, and other languages do not use cp1252! Therefore, UTF-8 is a good choice, as it can encode Unicode characters with one or more bytes, depending on the need.

Python 3 resolves many of these problems, but the documentation says [1]:

Files opened as text files (still the default mode
for open())
always use an encoding to map between strings (in memory) and bytes
(on disk). Binary files (opened with a b in
the mode argument) always use bytes in memory. This means that if a
file is opened using an incorrect mode or encoding, I/O will likely
fail loudly, instead of silently producing incorrect data. It also
means that even Unix users will have to specify the correct mode (text
or binary) when opening a file.

The part I underlined says: “… the system default is UTF-8; you should never count on this default…”

In summary, it’s a subject worth studying, as it causes “magical” problems that always appear.

A text explaining everything in detail can be found in [2].

Nilo Menezes

[1] http://docs.python.org/release/3.0.1/whatsnew/3.0.html [2] http://wiki.python.org.br/TudoSobrePythoneUnicode [3] http://mh-nexus.de/en/hxd/