Tuesday, February 16, 2010

unicode and python

this is a nice reference for dealing with unicode in python. explains things at just the right level.
one of the difficult things about working with unicode in python (as i have rediscovered once again) is that repr(), which gets called when you just ask for the return value of an expression, tries to encode as ascii (probably due to some language environment setting). but print() will exercise the terminal's capability to print out unicode characters. also, the default encoding (particular binary representation of, and therefore different from, the character set) is ascii.
In [426]: s = unicode('here\xe2\x80\x99s an apostrophe')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
/home/tippetts/Ubuntu One/crunchSvn/svn/python/optimization/ in ()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)
In [427]: s = unicode('here\xe2\x80\x99s an apostrophe','utf-8')
In [428]: s
Out[428]: u'here\u2019s an apostrophe'
In [429]: print s
--------> print(s)
here’s an apostrophe

No comments: