Why does Python sometimes update the string to unicode and sometimes not?

I'm confused. Consider how this code works as I expect:

>>> foo = u'Émilie and Juañ are turncoats.'
>>> bar = "foo is %s" % foo
>>> bar
u'foo is \xc3\x89milie and Jua\xc3\xb1 are turncoats.'

      

And this code does not work at all as I expect:

>>> try:
...     raise Exception(foo)
... except Exception as e:
...     foo2 = e
... 
>>> bar = "foo2 is %s" % foo2
------------------------------------------------------------
Traceback (most recent call last):
  File "<ipython console>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

      

Can someone please explain what's going on here? Why does it matter if the unicode data is in a plain Unicode string or is stored in an Exception object? And why fix it:

>>> bar = u"foo2 is %s" % foo2
>>> bar
u'foo2 is \xc3\x89milie and Jua\xc3\xb1 are turncoats.'

      

I am very confused! Thanks for the help!

UPDATE: My coding buddy Randall added to my confusion by trying to help me! Send in reinforcements to explain how this should make sense:

>>> class A:
...     def __str__(self): return "string"
...     def __unicode__(self): return "unicode"
... 
>>> "%s %s" % (u'niño', A())
u'ni\xc3\xb1o unicode'
>>> "%s %s" % (A(), u'niño')
u'string ni\xc3\xb1o'

      

Note that the order of the arguments here determines which method is called!

+2


a source to share


1 answer


The Python Language Reference has an answer:

If format

is a Unicode object, or if any of the objects that are converted by the conversion %s

are Unicode objects, the result is also a Unicode object.

foo = u'Émilie and Juañ are turncoats.'
bar = "foo is %s" % foo

      

It works because it foo

is an object unicode

. This causes the above rule to take effect and result in a Unicode string.

bar = "foo2 is %s" % foo2

      

In this case, it foo2

is an object Exception

, which is obviously not an object unicode

. So the interpreter tries to convert it to normal str

using the standard encoding. Apparently it ascii

is which cannot represent these characters and throws an exception.

bar = u"foo2 is %s" % foo2

      

Here it works again because the format string is an object unicode

. So the interpreter tries to convert foo2

to an object unicode

, which succeeds.




As for Randall's question: that surprises me too. However, this follows the standard (reformatted for readability):

%s

converts any Python object with str()

. If the supplied object or format is a string unicode

, the resulting string will also be unicode

.

How such an object is created unicode

remains unclear. So both are legal:

  • call __str__

    , decode back to Unicode string and insert it into output string
  • call __unicode__

    and insert the result directly into the output line

The mixed behavior of the Python interpreter is pretty disgusting. I would consider this to be a mistake in the standard.

Edit: Quoting the Python 3.0 changelog , emphasis mine:

Everything you knew about binary data and Unicode has changed.

[...]

  • As a consequence of this change in philosophy, almost all code that uses Unicode, encodings, or binary data is likely to change. Change for the better as the 2.x world had a lot of bugs related to mixing encoded and non-encoded text .
+10


a source







All Articles