How to show non-ascii characters in python?
I am using Python shell this way:
>>> s = 'Ã'
>>> s
'\xc3'
How can I print the variable s to show the character à ??? This is the first and easiest question. In fact, I am getting content from a web page with non-ascii characters like the previous ones, but others with a tilde like á, é, í, ñ, etc. Also, I am trying to execute a regex with these characters into a templated expression against the content of a web page.
How to solve this problem?
This is an example of one regex:
u'<td[^>]*>\s*Definición\s*</td><td class="value"[^>]*>\s*(?P<data>[\w ,-:\.\(\)]+)\s*</td>'
If I use Expresson app works great.
EDIT [05/26/2009 04:38 PM]: Sorry about my explanations. I will try to explain better.
I need to get text from a page. I have the url of this page and I have a regex to get this text. The first thing I thought was this regex was wrong. I tested it with Expresso and it worked fine, I got the text I wanted. So secondly, I thought I needed to print the content of the page, and that was when I saw that the content was not what I see in the original code of the web page. Differences are non-ascii characters like á, é, í, etc. Now I don't know what to do and if the problem is in the encoding of the page content or in the text of the regex pattern. One of the regexes I have defined is the previous one.
Wolud be question: is there a problem using regex which has non-ascii characters in the template ???
a source to share
Let's say you want to print it as utf-8. Before python 3, it's best to specifically code it
print u'Ã'.encode('utf-8')
if you get text from outside then you need to decode specifically ('utf-8) like
f = open(my_file)
a = f.next().decode('utf-8') # you have a unicode line in a
print a.encode('utf-8')
a source to share
I would use ord()
to find out if a character is ASCII / special:
if ord(c) > 127:
# special character
This probably won't work with multibyte encodings like UTF-8. In this case, I have to convert to Unicode before testing.
If you are getting special characters from a web page, you must know the encoding. Then decode it, see the Unicode HOWTO .
Edit: I definitely don't know what this question is about ... It might be a good idea to clarify this.
a source to share