Python UnicodeEncodeError: 'ascii' codec can't encode character
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa1' in position 0: ordinal not in range(128)
If you've ever gotten this error, Django's smart_str
function might be able to help. I found this from James Bennett's
article,
Unicode in the real world. He provides a very good explanation
of Python's Unicode and bytestrings, their use in Django, and using
Django's Unicode utilities for working with non-Unicode-friendly
Python libraries. Here are my notes from his article as it applies
to the above error. Much of the wording is directly from James
Bennett's article.
This error occurs when you pass a Unicode string containing non-English characters (Unicode characters beyond 128) to something that expects an ASCII bytestring. The default encoding for a Python bytestring is ASCII, "which handles exactly 128 (English) characters". This is why trying to convert Unicode characters beyond 128 produces the error.
The good news is that you can encode Python bytestrings in other encodings
besides ASCII. Django's smart_str function in the
django.utils.encoding module, converts a Unicode string
to a bytestring using a default encoding of UTF-8.
Here is an example using the built-in function, str:
a = u'\xa1'
print str(a) # this throws an exception
Results:
Traceback (most recent call last): File "unicode_ex.py", line 3, inprint str(a) # this throws an exception UnicodeEncodeError: 'ascii' codec can't encode character u'\xa1' in position 0: ordinal not in range(128)
Here is an example using smart_str:
from django.utils.encoding import smart_str, smart_unicode
a = u'\xa1'
print smart_str(a)
Results:
¡
Definitions
- Unicode string: sequence of Unicode characters
- Python bytestring: a series of bytes which represent a sequence of characters. It's default encoding is ASCII. This is the "normal", non-Unicode string in Python <3.0.
- encoding: a code that pairs a sequence of characters with a series of bytes
- ASCII: an encoding which handles 128 English characters
- UTF-8: a popular encoding used for Unicode strings which is backwards compatible with ASCII for the first 128 characters. It uses one to four bytes for each character.
References
22
Comments
—
Comments feed for this post
#2 Eliot commented on 2008-11-21:
Arthur, thanks for the tip. I'm not sure what differences the Django utility functions have. I will have to look into this further. For other readers, here is the documentation for encode: http://docs.python.org/library/stdtypes.html#str.encode
#3 enoola commented on 2009-02-07:
Hi mates, I wanted to print a string with chinese I found that simple and usefull : cf : http://members.shaw.ca/akochoi-old/blog/2005/10-02/index.html
# Simple unicode string
y = unicode(' 麻 婆 豆 腐', 'utf-8')
# Problem with this simple call..
# print y
# UnicodeEncodeError: 'ascii' codec can't encode character u'\u9ebb' in position 1: ordinal not in range(128)
# Solution
print y.encode('utf8')
#4 Lukas Monk commented on 2009-02-27:
I use this in my main unit :
reload(sys)
sys.setdefaultencoding( "latin-1" )
a = u'\xa1'
print str(a) # no exception
#5 Low Kian Seong commented on 2009-03-19:
Thanks for this man. Now my edit page does not bomb out anymore.
#6 Barbara commented on 2009-05-27:
Thanks so much for this - it's exactly what I needed, at exactly the right moment. :)
#9 William commented on 2009-11-10:
Python for windows do not have the attribute setdefaultencoding: What can I do to display utf-8 characters in text mode?
#10 Eliot commented on 2009-11-10:
William,
It seems like sys.setdefaultencoding is not designed for us to use. From the sys module documentaton:
This function is only intended to be used by the site module implementation and, where needed, by sitecustomize. Once used by the site module, it is removed from the sys module’s namespace.
If you're not using Django, using the encode string method (described by Arthur and enoola above) seems good. From the Django source code for smart_str, it looks like smart_str uses encode with some other logic that you may or may not need.
#14 chyro commented on 2010-02-01:
Same here, I found Lukas Monk's tip most useful as I don't want to use the "encode" function on every single string. I'm very puzzled about the "reload(sys)" part though. Why is the "setdefaultencoding" function not present until the module is reloaded? How come it makes any difference?
#15 chyro commented on 2010-02-01:
Sorry about the double post, it seems I'll answer my own question. I found more information on that function here: http://blog.ianbicking.org/illusive-setdefaultencoding.html It mostly raises the same question I did (in more details obviously). The real answers come in the second comment: http://blog.ianbicking.org/illusive-setdefaultencoding-comment-2.html That would kind of explain it. I'd still prefer everything being UTF-8.
#16 Eliot commented on 2010-02-02:
chyro, Thanks for adding this information. Also, I changed your plain-text URLs into clickable links.
#18 PhilGo20 commented on 2010-04-02:
You saved me some time ..again. Thanks
I would add that if one is using DOM to output to file (dom.toxml() or dom.toprettyxml()), make sure to add "encoding="utf-8"" parameters or you will also generate the same type of errors.
UnicodeEncodeError: 'ascii' codec can't encode character
#19 Loe Spee commented on 2010-05-05:
If you get this error when serializing data to JSON, it might be caused by the "ensure_ascii=False" option. Leaving this option out prevents the error form happening.
This will cause the error:
serializers.serialize('json', [data], ensure_ascii=False)
This will prevent the error:
serializers.serialize('json', [data])
More info at:
http://groups.google.com/group/django-users/browse_thread/thread/4f5f99b730ee0aae/
http://groups.google.com/group/django-users/browse_thread/thread/87b1478c02d743e0/
Post a comment
About
I'm Eliot and this is my notepad for programming topics such as Python, Django, Ubuntu, Emacs, etc... more »
Search Blog
Tags
-
algorithms
(4)
-
aws
(8)
-
blogproject
(20)
-
c_cplusplus
(12)
-
cardstore
(8)
-
colinux
(2)
-
concurrency
(9)
-
conkeror
(2)
-
cygwin
(18)
-
datastructures
(15)
-
datetime
(3)
-
dell
(3)
-
django
(39)
-
emacs
(20)
-
files_directories
(10)
-
install_setup
(7)
-
javascript
(3)
-
keyboard
(6)
-
matplotlib
(5)
-
mercurial
(4)
-
nginx
(2)
-
preferences
(8)
-
processes
(3)
-
pyqt
(18)
-
python
(122)
-
ratpoison
(3)
-
regexes
(5)
-
rsync
(3)
-
softwaretools
(17)
-
sql
(13)
-
ssh
(7)
-
subversion
(6)
-
twisted
(6)
-
ubuntu
(60)
-
urxvt
(5)
-
vxworks
(25)
-
webservices
(4)
-
wmii
(7)
Blogroll
- Adam Gomaa
- Alex Clemesha
- Amir Salihefendic
- Armin Ronacher
- David Beazley
- David Ziegler
- Duncan McGreggor
- Gareth Rushgrave
- Glyph Lefkowitz
- Guido van Rossum
- Ian Bicking
- Jacob Kaplan-Moss
- James Bennett
- James Tauber
- Jesper Noehr
- Matt Harrison
- Nikolay Kolev
- Parand Darugar
- Peter Baumgartner
- Peter Bengtsson
- Rob Hudson
- Simon Willison
- Will McGugan
#1 Arthur Buliva commented on 2008-11-20:
A simpler way to do this is:
print unicode(u'xa1').encode("utf-8")