23 Jun 02:53
lxml.html, now with ignored namespaces!
Thomas Weigel <seasong <at> chantofwaves.com>
2009-06-23 00:53:34 GMT
2009-06-23 00:53:34 GMT
I am using lxml to parse HTML documents, which include a custom namespace (for example, "<p cs:content='fruit'>FRUIT</p>"). In lxml 2.2.0, on Windows, this worked just fine, and elements could be processed based on this data. In lxml 2.2.2, on Linux, this fails. The above example becomes "<p content='fruit'>FRUIT</p>" as soon as it is parsed by lxml.html (or lxml.etree.HTMLParser()). I don't know if this is caused by the switch to Linux, or the upgrade to 2.2.2. I don't have control over the installation, so I can't switch to 2.2.2 under Windows, or 2.2.0 under Linux to check. I did find this reference (the only reference to this I could find) to the HTML ignoring namespaces: http://codespeak.net/lxml/lxmlhtml.html#running-html-doctests ...however, it wasn't doing that before, and it seems odd that this is only mentioned in the doctests section. Is there a way to work around this? Are custom namespaces simply not possible in lxml's HTML? Notes: 1. The XML parser will not work. Some documents will have legal HTML that breaks an XML parser, like "<br>". 2. Here is the sample code: ----- >>> import lxml.html as parser >>> document = parser.fromstring("""<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml" xmlns:cs="http://something.com/cs" xml:lang="en" lang="en"><head><title>Help!</title></head><body><p>My namespaces are going to disappear!</p><p cs:content='fruit'>FRUIT</p></body></html>""") >>> print parser.tostring(document) ----- The output: ----- <html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml" cs="http://something.com/cs" xml:lang="en" lang="en"><head><title>Help!</title></head><body><p>My namespaces are going to disappear!</p><p content="fruit">FRUIT</p></body></html> ----- Thomas Weigel
RSS Feed