Reading raw Unicode strings in Python

Question

Reading raw Unicode strings in Python

I'm new to Python, so my question might be stupid, but even after reading many threads, I haven't found an answer to my question.

I have a mixed source document which contains html, xml, latex and other text formats and which I am trying to get in latex only format.

So I used python to recognize the various commands as regular expressions and replace them with the corresponding latex command. Everything has been good so far.

Now I'm left with some "raw-type" Unicode characters like Greek letters. Unfortunaltly almost ready to do it manually. So I am looking for a way to do it in a smart way. Is there a way for Python to recognize / read them? And how can I tell python recognizes / reads eg. Is pi written like Greek?

The minimal example of code used is:

fh = open('SOURCE_DOCUMENT','r')
stuff = fh.read()
fh.close()

new_stuff = re.sub('READ','REPLACE',stuff)
fh = open('LATEX_DOCUMENT','w')
fh.write(new_stuff)
fh.close()

I'm not sure if this is important information or not, but I am using Python 2.6 running on Windows.

I would be very happy if someone could give me a hint, at least where to find the relevant information or how it might work. Or am I completely wrong and Python cannot do the job ...

Thank you very much in advance.
Cheers,
Britt

+1

python string unicode readability

Britta May 26 '09 at 9:54

a source to share

3 answers

Stephan202 · Answer 1 · 2009-05-26T10:09:09+0000

You're talking about raw Unicode strings. What does it mean? Unicode itself is not an encoding, but there are different encodings for storing Unicode characters (read this post from Joel).

The open function in Python 3.0 takes an optional argument encoding

that allows you to specify an encoding, eg. UTF-8 (a very common way to encode Unicode). In Python 2.x, take a look at the codecs module , which also provides open , which allows you to specify the file encoding.

Edit: Alternatively, why not just give those poor characters and specify the encoding of your LaTeX file at the top:

\usepackage[utf8]{inputenc}

(I've never tried this, but believe it should work. You may need to replace utf8

with utf8x

, though)

bendin · Answer 2 · 2009-05-26T10:42:40+0000

Please read the following first:

Absolute minimum Every software developer should absolutely, positively know about unicode and character sets (no excuses!)

Then go back and ask questions.

Aaron digulla · Answer 3 · 2009-05-26T10:39:30+0000

You need to determine the "encoding" of the input document. Unicode can encode millions of characters, but files can only contain 8-bit values from history (0-255). Therefore, the Unicode text must be encoded in some way.

If the document is XML, it must be on the first line (encoding = "..."; "utf-8" is the default if the "encoding" field does not exist). For HTML, search for "charset".

If all else fails, open the document in an editor where you can set the encoding ( jEdit for example). Try them until the text looks correct. Then use that value as a parameter encoding

for codecs.open()

in Python.

Reading raw Unicode strings in Python

More articles: