How to show non-ascii characters in python?

I am using Python shell this way:

>>> s = 'Ã'
>>> s
'\xc3'

      

How can I print the variable s to show the character à ??? This is the first and easiest question. In fact, I am getting content from a web page with non-ascii characters like the previous ones, but others with a tilde like á, é, í, ñ, etc. Also, I am trying to execute a regex with these characters into a templated expression against the content of a web page.

How to solve this problem?

This is an example of one regex:

u'<td[^>]*>\s*Definición\s*</td><td class="value"[^>]*>\s*(?P<data>[\w ,-:\.\(\)]+)\s*</td>'

      

If I use Expresson app works great.

EDIT [05/26/2009 04:38 PM]: Sorry about my explanations. I will try to explain better.

I need to get text from a page. I have the url of this page and I have a regex to get this text. The first thing I thought was this regex was wrong. I tested it with Expresso and it worked fine, I got the text I wanted. So secondly, I thought I needed to print the content of the page, and that was when I saw that the content was not what I see in the original code of the web page. Differences are non-ascii characters like á, é, í, etc. Now I don't know what to do and if the problem is in the encoding of the page content or in the text of the regex pattern. One of the regexes I have defined is the previous one.

Wolud be question: is there a problem using regex which has non-ascii characters in the template ???

0


a source to share


3 answers


Let's say you want to print it as utf-8. Before python 3, it's best to specifically code it

print u'Ã'.encode('utf-8')

      



if you get text from outside then you need to decode specifically ('utf-8) like

f = open(my_file)
a = f.next().decode('utf-8') # you have a unicode line in a
print a.encode('utf-8') 

      

+2


a source


How can I print the variable s to show the character à ???
use print

:



>>> s = 'Ã'
>>> s
'\xc3'
>>> print s
Ã

      

+2


a source


I would use ord()

to find out if a character is ASCII / special:

if ord(c) > 127:
    # special character

      

This probably won't work with multibyte encodings like UTF-8. In this case, I have to convert to Unicode before testing.

If you are getting special characters from a web page, you must know the encoding. Then decode it, see the Unicode HOWTO .

Edit: I definitely don't know what this question is about ... It might be a good idea to clarify this.

+1


a source







All Articles