Parsing HTML snippets with libxml SAX

Question

Parsing HTML snippets with libxml SAX

I need to parse HTML snippets by which I mean the files are missing <html>, <head> and <body> elements, otherwise having well-formed XHTML syntax, guaranteed to be UTF8 encoded. It looks like libxml is perfect for this task, but I have certain limitations that I just don't know how to implement.

htmlSAXParseFile () does its job well enough, but it seems to create the DOM itself by injecting body and html elements into the process. I would like to create the DOM myself because I might need to skip some elements and change others on the fly. Is there some way to tell libxml not to create a DOM at all and just parse the HTML and call my handlers?
If it is not possible for the libxml HTML parser, I could also use xmlSAXUserParseFile (), which doesn't seem to create the DOM. However, since the files are structured like <p> ... </ p → <p> ... </p>, the parser just spits "Additional content at the end of the document" too early, Is there a way to suppress some parsing errors while still getting notified of them (just because no one guarantees that there will never be other errors in those files)?
Libxml has a variety of parsing functions, some of which take an xmlParserOption as a parameter. Alas, xmlSAXUserParseFile () doesn't do this. And the ones that all do are creating the DOM for some of the non-local API reasons. Am I missing an obvious candidate?

Oh, and I confess that my reluctance to use the libxml DOM looks like a quirk. I am extremely limited by RAM, so I desperately need complete control over the DOM to be able to omit some nodes in low memory conditions and re-read them if necessary.

Thanks in advance.

+2

c html libxml2 sax

Costique May 14 '10 at 9:53

a source to share

1 answer

Costique · Accepted Answer · 2010-06-08T20:33:12+0000

OK, since no one has answered this question, I will try to do it myself.

I wrote all the handlers for the start / end element and it looks like libxml no longer creates the DOM. At least the returned document pointer is NULL. It still insists on html and body elements, but I can live with that.

One of the main problems is that libxml keeps all whitespace nodes no matter what. Therefore, I have to parse text content to eliminate uninformed whitespace. It's ugly, but it works. Should I mention that parsing UTF-8 is a treat that you rarely miss?

To be honest, the libxml documentation is terrible. My advice to anyone who has ever tried reading the docs: read the source code instead. The code is much more readable and documented.

Thank you for attention.

Parsing HTML snippets with libxml SAX

More articles: