Dr Unicode-Decode-Love

or how I stopped worrying and learned to love Unicode

Unicode 5.0Unicode is a pretty impressive specification. Its original aim was to unify all the character sets that were out there since the dawn of the computer, and it was not an easy task.

I won't dig much into the history of character sets, as there's Wikipedia for that. I won't even try to describe the Unicode standard in detail: the standard itself is in fact one of the bigger books I've ever set my eyes upon.

I mean physically big. It's a scary standard.

However, there are a few things about Unicode worth knowing:

  1. Unicode in itself does not mandate a binary representation of characters, but merely assigns a number (called codepoint) to each character to differentitate between them and for ordering purposes
  2. It demands serialization (meaning, the translation of characters in actual sequences of 1s and 0es) to the Unicode Transformation Format (UTF-*)
  3. The transformations formats are various, and mutually incompatible (you can't read an UTF-16 file as UTF-8)

The most common serialization of Unicode is UTF-8, which has two main advantages over others UTF-*:

  1. It is backwards-compatible with 7-bit (yes, 7) ASCII, meaning that you can load a 7-bit ASCII file as UTF-8 without having errors crop up.
  2. It is more reliable than others (e.g. UTF-16) and does not suffer from endianness problems (with the whole BOM mess of UTF-16)

In order to be compatible with ASCII, it uses a clever scheme of encoding where each character might expand to up to 6 bytes, using a marking system (this makes UTF-8 able to represent 231 values). Whenever a decoder runs through a bytestream, it is able, by looking at the leftmost bytes, to know how many bytes actually make up the character.

Fun, but how do I use it in Python?

I'll assume, here, that we are using the 2.x version of Python. The main difference between 2.x and 3.x, regarding Unicode, is that whenever in 3.x you do something like:

a = "hej där"

a is not a bytestream (aka, a set of bytes in the source file's encoding) but rather a full unicode object, or, in 2.x idiom:

a = u"hej där"

From this moment onwards, however, I will focus on Python 2.x, therefore whenever I say "Python" I mean "Python 2.x".

The unicode object

In Python, a character is a byte, and strings are nothing but a serie of "meaningless" bytes. In fact all the operations to read and write from streams use strings, and those strings are not necessarily intended to be human-readable (e.g. a PNG image).

On contrast, unicode strings and objects (the strings with u in front) are used exclusively to store human-readable text. But unicode objects are not strings, and while they can be expressed as such, you have to do that explicitly. You can't send a unicode object over a network socket as much as you can't send a MyClass object straight through it, unless you serialize it. So, to put it in a non-standard and probably incorrect way, unicode objects are "complex" objects that need a serialization step in order to be transformed to and into sequence of bytes (aka strings).

So, let's for a while imagine that you have your object of type MyClass and want to serialize/deserialize it, for example to write it into a file. What would you do?

You would probably use one of the serialization methods that are available, for example pickle. And you would take care to always deserialize objects before manipulating them and serializing them again before sending them onto the storage or over the network.

You should do exactly the same thing with unicode strings.

Unicode objects can be serialized in many ways, some lossless and some lossy.

The lossless methods are:

  • Pickling
  • JSON (which is encoding-independent)
  • UTF-8
  • UTF-16
  • UTF-32

The lossy methods are:

  • ASCII
  • ISO-8859-*
  • Whatever other character encoding you can think of

To serialize an unicode object you use the encode() method:

>>> u"Hej där".encode("utf-8")
'Hej d\xc3\xa4r'
>>> u"Hej där".encode("utf-16")
'\xff\xfeH\x00e\x00j\x00 \x00d\x00\xe4\x00r\x00'
>>> u"Hej där".encode("utf-32")
'\xff\xfe\x00\x00H\x00\x00\x00e\x00\x00\x00j\x00\x00\x00 \x00\x00\x00d\x00\x00\x00\xe4\x00\x00\x00r\x00\x00\x00'
>>> u"Hej där".encode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 5: ordinal not in range(128)
>>> u"Hej där".encode("iso-8859-15")
'Hej d\xe4r'

Notice how ascii turned up an error, because it cannot encode ä. If you are willing to lose information, you can pass an optional parameter that tells the encoder how to behave in such cases. For example, I can tell it to ignore characters it cannot serialize:

>>> u"Hej där".encode("ascii", "ignore")
'Hej dr'

Although the result doesn't quite work the same way (from "Hello there" to "Hello doctor", which does work only in limited cases). You can notice how the serialization was quite different from each encoding: utf-8 used two bytes for the special character, iso-8859-15 used one byte, while utf-16 and utf-32 respectively used two and four bytes for all the characters (including the ones that are in ASCII too).

Serializing works by using the decode() method of strings.

It is very important that you use the same method to deserialize that was used to serialize

In fact, as much as you wouldn't try to depickle an object that you have serialized through JSON, you shouldn't try to decode with the wrong method (although it might work due to cross compatibility between them, for limited cases)

>>> 'Hej d\xc3\xa4r'.decode("utf-8")
u'Hej d\xe4r'
>>> '\xff\xfeH\x00e\x00j\x00 \x00d\x00\xe4\x00r\x00'.decode("utf-16")
u'Hej d\xe4r'
>>> '\xff\xfe\x00\x00H\x00\x00\x00e\x00\x00\x00j\x00\x00\x00 \x00\x00\x00d\x00\x00\x00\xe4\x00\x00\x00r\x00\x00\x00'.decode("utf-32")
u'Hej d\xe4r'
>>> 'Hej d\xe4r'.decode("iso-8859-15")
u'Hej d\xe4r'
>>> 'Hej d\xe4r'.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 5: unexpected end of data

You can see how in the last example, the decoder complains about the utf-8 stream ending abruptly: it was in fact expecting two bytes, not just one!

So what should I do?

Ideally, you should:

  1. Have all the text that is supposed to be human readable within your application as unicode objects, not strings
  2. Encoding unicode objects appropriately before sending them over the network or on a file (most RDBMS APIs accept unicode objects as input, so you shouldn't worry about that)
  3. Decoding each stream that comes in, using the appropriate method: which means you should make sure to know what method was used to encode the incoming stream (Content-Type headers, xml declarations, meta tags, etc... most sane formats have a way to handle this)

The silver lining is:

Most likely, your framework of choice handles 2 and 3 for you.

In case where it doesn't, the first thing you have to determine is how the stream was encoded (utf-8? utf-16? hahaha-i-invented-a-charset-for-fun?) and then act upon it using decode(). Do not manipulate encoded strings, it is as bad as meddling with whatever pickle spews out, or using a magnetized needle to edit files.

Also, keep in mind that unicode(string_of_bytes) is exactly like doing string_of_bytes.decode(sys.getdefaultencoding()), only way less obvious. So be easy on your brain and avoid it.

Known annoyances

Some frameworks sometimes do not behave very rationally (although there are probably good reasons for that, but I don't know them), and this causes some problems.

  • Archetypes: some methods will insist in returning utf-8 encoded data. If then you encode that again, you'll end up with double encoded data (Zope page templates are very lenient on this, so this is not a problem, usually, but if you try passing that to sqlite you might end up with double-encoded data). So you might want to decode those values before passing them over to other, non-Plone APIs.

Dr. Unicode's anatomy

A simple symptom-to-illness mapping for you men of medicine out there.

Accented letters don't show up, but instead some strange characters pop up, two for each accented/special character

Pipelined encoding and decoding with someone getting the wrong codec at some point. Probably something like this happens:

>>> print u"Hej där".encode("utf-8").decode("latin1").encode("utf-8")
Hej där

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)

Most likely you're attempting to double-encode: this is tricky because it talks about 'ascii' codec but might actually be something like this:

>>> print u"Hej där".encode("utf-8").encode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 5: ordinal not in range(128)

A lossy method is being used, but no loss is expected (in short, there are some characters that can't be serialized). Most likely something like this:

>>> print u"Hej där".encode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 5: ordinal not in range(128)

Errata corrige

I expect this post to be full of factual errors. Kudos to everyone that helps me making it better! Also, if you have some errors and solutions you want to share, please comment or send me a mail at simone.deponti (character encodable in ascii that looks like a funny a) abstract.it

Share this on

Share |

On same topics

Comments

comments powered by Disqus