Reading raw Unicode strings in Python
I'm new to Python, so my question might be stupid, but even after reading many threads, I haven't found an answer to my question.
I have a mixed source document which contains html, xml, latex and other text formats and which I am trying to get in latex only format.
So I used python to recognize the various commands as regular expressions and replace them with the corresponding latex command. Everything has been good so far.
Now I'm left with some "raw-type" Unicode characters like Greek letters. Unfortunaltly almost ready to do it manually. So I am looking for a way to do it in a smart way. Is there a way for Python to recognize / read them? And how can I tell python recognizes / reads eg. Is pi written like Greek?
The minimal example of code used is:
fh = open('SOURCE_DOCUMENT','r') stuff = fh.read() fh.close() new_stuff = re.sub('READ','REPLACE',stuff) fh = open('LATEX_DOCUMENT','w') fh.write(new_stuff) fh.close()
I'm not sure if this is important information or not, but I am using Python 2.6 running on Windows.
I would be very happy if someone could give me a hint, at least where to find the relevant information or how it might work. Or am I completely wrong and Python cannot do the job ...
Thank you very much in advance.
Cheers,
Britt
You're talking about raw Unicode strings. What does it mean? Unicode itself is not an encoding, but there are different encodings for storing Unicode characters (read this post from Joel).
The open function in Python 3.0 takes an optional argument encoding
that allows you to specify an encoding, eg. UTF-8 (a very common way to encode Unicode). In Python 2.x, take a look at the codecs module , which also provides open , which allows you to specify the file encoding.
Edit: Alternatively, why not just give those poor characters and specify the encoding of your LaTeX file at the top:
\usepackage[utf8]{inputenc}
(I've never tried this, but believe it should work. You may need to replace utf8
with utf8x
, though)
a source to share
You need to determine the "encoding" of the input document. Unicode can encode millions of characters, but files can only contain 8-bit values from history (0-255). Therefore, the Unicode text must be encoded in some way.
If the document is XML, it must be on the first line (encoding = "..."; "utf-8" is the default if the "encoding" field does not exist). For HTML, search for "charset".
If all else fails, open the document in an editor where you can set the encoding ( jEdit for example). Try them until the text looks correct. Then use that value as a parameter encoding
for codecs.open()
in Python.
a source to share