Why does Python sometimes update the string to unicode and sometimes not?

Question

Why does Python sometimes update the string to unicode and sometimes not?

I'm confused. Consider how this code works as I expect:

>>> foo = u'Émilie and Juañ are turncoats.'
>>> bar = "foo is %s" % foo
>>> bar
u'foo is \xc3\x89milie and Jua\xc3\xb1 are turncoats.'

And this code does not work at all as I expect:

>>> try:
...     raise Exception(foo)
... except Exception as e:
...     foo2 = e
... 
>>> bar = "foo2 is %s" % foo2
------------------------------------------------------------
Traceback (most recent call last):
  File "<ipython console>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

Can someone please explain what's going on here? Why does it matter if the unicode data is in a plain Unicode string or is stored in an Exception object? And why fix it:

>>> bar = u"foo2 is %s" % foo2
>>> bar
u'foo2 is \xc3\x89milie and Jua\xc3\xb1 are turncoats.'

I am very confused! Thanks for the help!

UPDATE: My coding buddy Randall added to my confusion by trying to help me! Send in reinforcements to explain how this should make sense:

>>> class A:
...     def __str__(self): return "string"
...     def __unicode__(self): return "unicode"
... 
>>> "%s %s" % (u'niño', A())
u'ni\xc3\xb1o unicode'
>>> "%s %s" % (A(), u'niño')
u'string ni\xc3\xb1o'

Note that the order of the arguments here determines which method is called!

+2

python unicode

samtregar May 19 '10 at 17:10

a source to share

1 answer

Thomas · Accepted Answer · 2010-05-19T17:15:44+0000

The Python Language Reference has an answer:

If format

is a Unicode object, or if any of the objects that are converted by the conversion %s

are Unicode objects, the result is also a Unicode object.

foo = u'Émilie and Juañ are turncoats.'
bar = "foo is %s" % foo

It works because it foo

is an object unicode

. This causes the above rule to take effect and result in a Unicode string.

bar = "foo2 is %s" % foo2

In this case, it foo2

is an object Exception

, which is obviously not an object unicode

. So the interpreter tries to convert it to normal str

using the standard encoding. Apparently it ascii

is which cannot represent these characters and throws an exception.

bar = u"foo2 is %s" % foo2

Here it works again because the format string is an object unicode

. So the interpreter tries to convert foo2

to an object unicode

, which succeeds.

As for Randall's question: that surprises me too. However, this follows the standard (reformatted for readability):

%s

converts any Python object with str()

. If the supplied object or format is a string unicode

, the resulting string will also be unicode

.

How such an object is created unicode

remains unclear. So both are legal:

call __str__

, decode back to Unicode string and insert it into output string
call __unicode__

and insert the result directly into the output line

The mixed behavior of the Python interpreter is pretty disgusting. I would consider this to be a mistake in the standard.

Edit: Quoting the Python 3.0 changelog , emphasis mine:

Everything you knew about binary data and Unicode has changed.

[...]

As a consequence of this change in philosophy, almost all code that uses Unicode, encodings, or binary data is likely to change. Change for the better as the 2.x world had a lot of bugs related to mixing encoded and non-encoded text .

Why does Python sometimes update the string to unicode and sometimes not?

More articles: