It was quite some time since the last post, but this does not mean I haven’t done anything interesting :). It is just that it was so much interesting that I didn’t have any time to write anything.
Anyway, this time I have stumbled (again) a unicode problem using some Python code which was supposed to be perfectly suitable for doing this since it even started with
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
It went perfectly fine when running in Eclipse, but to my huge surprise I’ve got problems when running the unit tests from the command line in terminal. Whaaat? It just worked!
Well, declaring your source as UTF-8 is not enough of course. There are several things to check when getting the “UnicodeDecodeError: ‘ascii’ codec can’t decode byte … in position …: ordinal not in range(128)”-kind of errors. Googling around didn’t bring me much luck to my surprise, so there is are my findings for the next time :).
First of all make absolutely sure you haven’t forgotten the ‘u’ character before your strings containing the unicode strings. Yep, just like that you screw up the rest of the unicode support. Python (ok, I admit, I use 2.5.4) treats a ‘string’ as a regular string and not as a unicode. So write u’string’ instead!
Second, when doing things file operations don’t forget that you don’t get the unicode by default. Consider the following:
message = u'unicode message'
file_handle.write(message)
Well, guess what. You get a problem when writing the string away. It cannot be recognized. So the solution would be to do something like this
encoded_message = message.encode(u"utf-8")
file_handle.write(encoded_message)
But that’s only a half of the problem. At some point you will be reading this data back. And most probably you would like to get your beloved unicode thingy back. Just doing the following will hardly help:
file_handle = open(full_name, 'r')
line = file_handle.readline()
The following will save your day:
file_handle = open(full_name, 'r')
line = file_handle.readline().decode('utf-8')
Voila. I hope this saves some frustration to somebody, even if it will be me some month later :).
Enjoy!
Comments from Andy (thanks!):
Probably there is nothing new for you in what I am saying below, however, from my experience, it covers 99% unicode-related errors.
Unicode string is s sequence of code points in range 0 to 0x10ffff. Encoding is a way of serializing this sequence, so thay can be represented in memory, written to a file, sent over a socket etc.
Encoding unicode string is needed _at least_ because its ‘as is’ byte representation is not portable due to byte order issues.
It is _recommended_ that you work with unicode string internally provided the language/API supports unicode
It is _must_ that you encode the string to be consumed by another program.