Why UTF-8 and Not ASCII for Portuguese? (PART II)

Continuation of the post, originally made in the Python-Brasil list:

I’ll try again, as the thread has already discussed three different things:

Code to use in Python programs: why UTF-8 is highly recommended
Encodings in general and problems caused and solved by it
A bug in Python on Windows, when the prompt is set to page 65001

I’ll try to explain for everyone, as it’s a recurring topic.

But before getting back to these topics, we have to go back to files.

Attempt 2:

To understand character encoding, you need to understand what it’s about. Who programs knows the ASCII code, which maps each character of the Latin alphabet, control codes and some symbols in 7 bits. It’s 7 bits, so it goes from 0 to 127. This worked well in the 1960s… when data transmission and storage time were saved by squeezing a bit, before torrents and others :-D. Internationalization wasn’t yet in focus, by the way. ASCII stands for American Standard Code for Information Interchange. American, say state-united.

Our computers today use 8 bits per byte, but it wasn’t always like that. IBM and then Microsoft took advantage of the extra bit to complete 8 bits and added more 128 characters, once each additional bit doubles the representation capacity, power of 2, etc. These characters were used to represent border, symbol and some accent characters. Who uses DOS remembers this.

As 256 symbols aren’t enough to represent all characters from all languages, IBM and others created country or language-specific code pages. Thus, code page 437 (cp437) contained drawing symbols and accented characters used in languages like French and Spanish, which meet the needs of North America and some European languages. A language not fully covered is Portuguese, as page 437 doesn’t have ã norõ. This problem was solved with page 850, which exchanges some drawing symbols and less used symbols for accents from various Western European languages.

After a lot of history, let’s get back to how it affects our bytes.

In the ASCII code, the uppercase letter A is represented by the number 65 in decimal or 0x41 in hexadecimal. The B is the next letter, so they gave it the number 66 or 0x42.

If you have a file with only two letters AB one after another, it will occupy (your data) two bytes on disk. The binary content of the file on disk are the sequence of bytes 0x41 and 0x42 (AB or 65 and 66). It’s very important to understand this encoding before continuing reading. If you don’t understand that an A is stored as the number 65, forget UTF-8… it will be necessary to reread or ask a friend. In old times, computer science courses started with binary system and ASCII table, now the first program downloads web pages… but the base theory is left behind.

Both in page 437 and 850, the entire ASCII table, or its 127 characters, were preserved. Thus, our file AB was shown the same way on both pages. The difference starts to appear when we use different characters between them.

Now imagine that we add a Ã, using page 850, because we write on a computer configured for Portuguese: ABÃ On disk, we would have 3 bytes: 0x41 0x42 0xC3 In page 850, the Ã is translated to symbol 199 or hexadecimal 0xC3.

Now imagine that we send this file to an American friend who opens it on a computer using page 437. The content on disk remains the same: 0x41 0x42 0xC3, but what he sees on screen is: AB For where did our Ã go? It will be there if we use the same code page, that is page 850 that we used to write.

With only 3 bytes, already we can see the problem… now imagine with entire files!

As computers spread around the world, several pages were created for Russian, Greek, etc. Imagine then writing a file with parts in Greek, Russian and Chinese… chaos!

What a character encoding table does is map a numerical value to a graphical symbol or character. You choose the table you want to use, but to make a translation between tables, you need to know which table was used to encode the current data and for which table you want to translate.

Another problem is that languages like Chinese need more than 256 symbols to represent normal text, since their alphabet is much larger than ours. Then multi-byte tables were created, where more than one byte was used per character. Still, you needed to know which multi-byte table was used… repeating the confusion. Who has worked with Windows in C++ using MBCS knows the pain that causes…

One of the solutions for multiple languages is creating a table that solves all problems, the UNICODE was created. Thus, all languages would be represented. The problem is that to contain all symbols, several bytes would have to be used even for Latin characters.

Thus, each letter, in a simplification, would be represented by 2 bytes (simplification, because 2 bytes are not enough, as we have more than 65536 characters in Unicode!). Continuing, our A in unicode is represented as 0x00 0x41 and the B as 0x00 0x42. Each letter now passes to be represented by two bytes and one of them is the feared 0x00! The Ã remained at position 0x00 0xC3. On disk:

0x00 0x41 0x00 0x42 0x00 0xC3

Now we use 6 bytes for 3 letters. Already problems with data storage emerged, since files began to double in size and take twice as long to transmit… in theory. A more compact way of representing these characters was developed: UTF-8.

Using the same unicode table base, but introducing a page swap scheme, ABÃ in utf-8 are written on disk as:

0x40 0x41 0xC3 0x83

We passed from 6 to 4 bytes without losing the ability to write in almost any language!

What happens with Python. A py file with only one line to print ABÃ can be written as:

print "ABÃ"

On disk, it will be saved if we use a utf-8 editor to write it:

0x70 0x72 0x69 0x6E 0x74 0x20 0x22 0x41 0x42 0xC3 0x83 0x22 0x0D 0x0A

These bytes are what the Python.exe will read. In utf-8, these bytes would be translated to:

p r i n t " A B 
c 
"

But the interpreter doesn’t know this! It says: Non-ASCII and then \xc3 which is another way of saying 0xC3.

Why? In Python 2, the file was read as ASCII, having only symbols from 0 to 127. 0xC3 is 199, or hexadecimal, or outside the ascii table, so there’s an error.

To fix this, we need to put the # coding: utf-8 The program becomes:

# coding: utf-8
print "ABÃ"

On disk:

0x23 0x20 0x63 0x6F 0x64 0x69 0x6E 0x67 0x3A 0x20 0x75 0x74 0x66 0x2D
0x38 0x0D 0x0A # coding: utf-8
0x70 0x72 0x69 0x6E 0x74 0x20 0x22 0x41 0x42 0xC3 0x83 0x22 0x0D 0x0A    print "ABÃ"

The only difference is the letter u, but the result is different:

C:\Users\nilo>\Python27\python.exe Desktop\test.py
ABÃ

It came out correctly! Why? Because Python knows that the string is unicode and that console output on my Windows uses cp850. So it converts bytes during printing so they are presented correctly.

That’s why it’s essential to understand the encoding of your file and the encoding of the console, database, etc. You need to help the program behave well.

Now let’s go back to the error with an invalid header, where we declared UTF-8 but our editor saved using Windows cp1252: Visually the file has the same content:

# coding: utf-8
print u"ABÃ"

But on disk:

0x23 0x20 0x63 0x6F 0x64 0x69 0x6E 0x67 0x3A 0x20 0x75 0x74 0x66 0x2D
0x38 0x0D 0x0A # coding: utf-8
0x70 0x72 0x69 0x6E 0x74 0x20 0x75 0x22 0x41 0x42 0xC2 0x83 0x22 0x0D 
0x0A    print u"ABÃ"

It results in:

C:\Users\nilo>\Python27\python.exe Desktop\test.py
  File "Desktop\test.py", line 2
    print u"ABÃ"      
SyntaxError: (unicode error) 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data

Why? If you compare the second line in hexadecimal with the example above, you’ll see that in cp1252, the Ã was translated to 0xC2 or hexadecimal. So only one byte. But we declared UTF-8! The interpreter Python is a program and trusts what we declare. It reads the file as if it were UTF-8 and finds the 0xC3 which isn’t just a character but the marker of page swap. After reading 0xC3 it expects the other byte of this page, but finds the quotes (0x22). 0xC2 0x22 is an invalid sequence in UTF-8. The interpreter explodes with a codec exception.

Going back to the beginning of the topic:

Code to use in Python programs: why UTF-8 is highly recommended Why you can send your programs to other computers (linux, mac, windows) and use accents, avoiding future problems. But only works if your header expresses the real encoding used in the file. Otherwise it doesn’t work.
Encodings in general and problems caused and solved by it I think this is answered at the beginning of the message.
A bug in Python on Windows, when the prompt is set to page 65001 Besides IBM’s pages, Microsoft also has its own. Between them there is cp1252 and cp 65001 for UTF8. If you configure and only if you configure your console to use page 650001, utf-8, the result is the following:

C:\Users\nilo>chcp 65001
Active code page: 65001       
C:\Users\nilo>\Python27\python.exe Desktop\test.py
Traceback (most recent call last):
File "Desktop\test.py", line 2, in
           print u"ABÃ"
LookupError: unknown encoding: cp65001
C:\Users\nilo>\Python32\python.exe Desktop\test.py      
Fatal Python error: Py_Initialize: can't initialize sys standard streams      
LookupError: unknown encoding: cp65001
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

It is only in this specific and unnecessary case for Portuguese that we have an open bug still in Python 3.3.

It’s not a Windows bug, as it works in Java, C# and C. It’s just how the interpreter handles cp65001 differently from utf8.