15 May 06:21
Re: html entities and lxml.html.ElementSoup
From: roger patterson <rogerpatterson <at> gmail.com>
Subject: Re: html entities and lxml.html.ElementSoup
Newsgroups: gmane.comp.python.lxml.devel
Date: 2008-05-15 04:21:52 GMT
Subject: Re: html entities and lxml.html.ElementSoup
Newsgroups: gmane.comp.python.lxml.devel
Date: 2008-05-15 04:21:52 GMT
Hi Viksit, What you typed was correct, except you have to note that lxml.html.soupparser.convert_tree(soup) returns a *list* of root elements, so you can't just do a lxml.etree.tostring() on the list. Depending on your HTML, choosing the first element will probably work. I have moved to the trunk now, so am working well with the new lxml.html.soupparser. But if you're stuck on that branch, then that work-around worked for me. Hope it works for you! cheers -Roger 2008/5/14 Viksit Gaur <viksit <at> aya.yale.edu>: > Hi there, > >>Roger Patterson wrote: >>> I'm getting an interesting situation. When using the very cool >>> ElementSoup add-on to lxml.html with certain source-html files that >>> already encode entities (eg. £), using the ElementSoup.parse() >>> messes up the entities. > > I'm running into the same problem. > >>It looks like it's not the parse(), but rather the serialisation. What >> >happens >>is that the entity references end up in the /text/ content, which is >> >clearly >>wrong as it leads to re-escaping of the references on the way out. > >>> What I'm currently doing to solve this is first parsing it with >>> BeautifulSoup(html, convertEntities="html"), then calling >>> ElementSoup.convert_tree(soup). This work-around works fine, but I >>> thought I'd bring it to your attention. > > Did you mean something of the sort, > > soup = BeautifulSoup(doc, convertEntities="html") > root = lxml.html.soupparser.convert_tree(soup) > > Because I get an error of the form: > > File "lxml.etree.pyx", line 2491, in lxml.etree.tostring > (src/lxml/lxml.etree.c:21792) > TypeError: Type 'list' cannot be serialized. > > > >>ElementSoup should do that for you. I fixed it on the trunk. > >>Stefan > > Unfortunately, I can't switch to lxml trunk. Would it be possible for you to > point me to the code change in lxml so I can patch it myself? > > Thanks and Cheers, > Viksit >
RSS Feed