Why does Python sometimes update the string to unicode and sometimes not?
I'm confused. Consider how this code works as I expect:
>>> foo = u'Émilie and Juañ are turncoats.'
>>> bar = "foo is %s" % foo
>>> bar
u'foo is \xc3\x89milie and Jua\xc3\xb1 are turncoats.'
And this code does not work at all as I expect:
>>> try:
... raise Exception(foo)
... except Exception as e:
... foo2 = e
...
>>> bar = "foo2 is %s" % foo2
------------------------------------------------------------
Traceback (most recent call last):
File "<ipython console>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Can someone please explain what's going on here? Why does it matter if the unicode data is in a plain Unicode string or is stored in an Exception object? And why fix it:
>>> bar = u"foo2 is %s" % foo2
>>> bar
u'foo2 is \xc3\x89milie and Jua\xc3\xb1 are turncoats.'
I am very confused! Thanks for the help!
UPDATE: My coding buddy Randall added to my confusion by trying to help me! Send in reinforcements to explain how this should make sense:
>>> class A:
... def __str__(self): return "string"
... def __unicode__(self): return "unicode"
...
>>> "%s %s" % (u'niño', A())
u'ni\xc3\xb1o unicode'
>>> "%s %s" % (A(), u'niño')
u'string ni\xc3\xb1o'
Note that the order of the arguments here determines which method is called!
a source to share
The Python Language Reference has an answer:
If
format
is a Unicode object, or if any of the objects that are converted by the conversion%s
are Unicode objects, the result is also a Unicode object.
foo = u'Émilie and Juañ are turncoats.'
bar = "foo is %s" % foo
It works because it foo
is an object unicode
. This causes the above rule to take effect and result in a Unicode string.
bar = "foo2 is %s" % foo2
In this case, it foo2
is an object Exception
, which is obviously not an object unicode
. So the interpreter tries to convert it to normal str
using the standard encoding. Apparently it ascii
is which cannot represent these characters and throws an exception.
bar = u"foo2 is %s" % foo2
Here it works again because the format string is an object unicode
. So the interpreter tries to convert foo2
to an object unicode
, which succeeds.
As for Randall's question: that surprises me too. However, this follows the standard (reformatted for readability):
%s
converts any Python object withstr()
. If the supplied object or format is a stringunicode
, the resulting string will also beunicode
.
How such an object is created unicode
remains unclear. So both are legal:
- call
__str__
, decode back to Unicode string and insert it into output string - call
__unicode__
and insert the result directly into the output line
The mixed behavior of the Python interpreter is pretty disgusting. I would consider this to be a mistake in the standard.
Edit: Quoting the Python 3.0 changelog , emphasis mine:
Everything you knew about binary data and Unicode has changed.
[...]
- As a consequence of this change in philosophy, almost all code that uses Unicode, encodings, or binary data is likely to change. Change for the better as the 2.x world had a lot of bugs related to mixing encoded and non-encoded text .
a source to share