Python MS Word
Possible duplicate:
Read / write MS Word files in Python
I am studying a requirements management system (for example, requiste pro - Rational Rose) - and you will need to read a MS Word document to find specific tags - in a Windows or Apple OS environment. Is there any known framework for this (I couldn't find one) - or suggested approaches?
Just to add some clarification - this is not going to be a one-time read, I would go through the document every time there is an update for it and do CRUD on specific areas of requirements.
a source to share
First, extract it from the native Word format ( .doc
).
-
Make "Save as XML" and insist that your users work with this file instead of the file
.doc
. They will hardly notice the difference - except that the file is larger.Use lxml or element tree to parse XML and find headings, sections, paragraphs, and lists.
-
You can also do "Save as HTML" before doing your analysis. This works as well as the XML version. However, the HTML version is not that easy for users, so only do this before your analysis.
Use Beautiful Soup to parse HTML and find headings, sections, paragraphs, and lists.
Once you have the parsing structure (XML or HTML), you can parse the document looking for specific tags.
a source to share
You can build on openoffice.org's ability to read Word documents. The Python-UNO bridge allows you to use the standard OpenOffice.org API
python scripting language. Using Python-UNO
and having the relevant openoffice parts on your computer should be easy to read most Word documents.
a source to share
Using Visual Studio Tools for Office (VSTO) , you can script Word
from any language .NET
. A practical guide. Searching for text in documents shows the code C#
and Visual Basic
, but IronPython
can also call the same methods .NET
.
If you're willing to use IronPython (no Mac equivalent), this might be a Windows specific solution for searching within Word
documents.
a source to share
Assuming you are on windows and have Word, you can control Word from within python using COM - see Python for win32 On Linux, you can do the same with OpenOffice.
Alternatively there are many Word line extractors for win32 or Linux, then you can use regular python regex tools.
See this question extracting text from MS word files in python
a source to share
If you have some money, you can buy the Aspose.Word Java API. With it, you can programmatically access and manage any Word document.
a source to share
I know this is a Python question, but ...
On Windows, you must use VBScript (VBA Macros) and OLE to access Word programmatically.
Examples | How-tos | Word automation using OLE
On MacOSX, you use VBA for older versions and AppleScript for Office 2008.
With VBA, you have the choice of either modifying the document in-place, or doing an automatic "Save As" to get the data in a more user-friendly format (though be warned that its HTML export is terrible).
I highly recommend staying away from third party libraries / products even if you don't like vbscript. The format is too complex, undocumented, and inconsistent for accurate external processing. StarOffice / OpenOffice is proof of this. They have been trying for years and have no accurate parsing of .doc, let alone .docx. Yes, it works in general, but you run into an intractable manipulation of documents as soon as you start trying to program them outside of Word. You should be able to call VBscript from Python using os.system. I think the interpreter is wscript.exe, but don't keep me going. This might work though:
os.system('start script.vb')
a source to share