SaltyCrane Blog — Notes on JavaScript and web development

The old "%" string formatting and the new string .format() method handle unicode differently

Today I learned that the old style "%" string formatting and the new string .format() method behave differently when interpolating unicode strings. I was suprised to find out that one of these lines raised an error while one did not:

'%s' % u'O\u2019Connor'
'{}'.format(u'O\u2019Connor')

The old style "%" formatting operation returns a unicode string if one of the values is a unicode string even when the format string is a non-unicode string:

Python 2.7.3 (default, Feb 27 2014, 19:58:35) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> '%s' % u'O\u2019Connor'
u'O\u2019Connor'

The new string .format() method called on a non-unicode string with a unicode string argument tries to encode the unicode string to a non-unicode string (bytestring) possibly raising a UnicodeEncodeError:

Python 2.7.3 (default, Feb 27 2014, 19:58:35) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> '{}'.format(u'O\u2019Connor')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 1: ordinal not in range(128)</module></stdin>

I guess the correct thing to do is to start with a unicode format string:

Python 2.7.3 (default, Feb 27 2014, 19:58:35) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'{}'.format(u'O\u2019Connor')
u'O\u2019Connor'

See also

Comments


#1 JeromeJ commented on :

"Python 2".


Python 3 never tries to convert implicitely: Result is pain is more immediate but forces the programmer to understand what he is doing rather than try&fail encode then decode up until it finally works…