roger patterson | 15 May 06:21
Picon

Re: html entities and lxml.html.ElementSoup

Hi Viksit,

What you typed was correct, except you have to note that
lxml.html.soupparser.convert_tree(soup) returns a *list* of root
elements, so you can't just do a lxml.etree.tostring() on the list.
Depending on your HTML, choosing the first element will probably work.

I have moved to the trunk now, so am working well with the new
lxml.html.soupparser.  But if you're stuck on that branch, then that
work-around worked for me.  Hope it works for you!
cheers
-Roger

2008/5/14 Viksit Gaur <viksit <at> aya.yale.edu>:
> Hi there,
>
>>Roger Patterson wrote:
>>> I'm getting an interesting situation.  When using the very cool
>>> ElementSoup add-on to lxml.html with certain source-html files that
>>> already encode entities (eg. &#163;), using the ElementSoup.parse()
>>> messes up the entities.
>
> I'm running into the same problem.
>
>>It looks like it's not the parse(), but rather the serialisation. What
>> >happens
>>is that the entity references end up in the /text/ content, which is
>> >clearly
>>wrong as it leads to re-escaping of the references on the way out.
>
>>> What I'm currently doing to solve this is first parsing it with
>>> BeautifulSoup(html, convertEntities="html"), then calling
>>> ElementSoup.convert_tree(soup).  This work-around works fine, but I
>>> thought I'd bring it to your attention.
>
> Did you mean something of the sort,
>
> soup = BeautifulSoup(doc, convertEntities="html")
> root = lxml.html.soupparser.convert_tree(soup)
>
> Because I get an error of the form:
>
> File "lxml.etree.pyx", line 2491, in lxml.etree.tostring
> (src/lxml/lxml.etree.c:21792)
> TypeError: Type 'list' cannot be serialized.
>
>
>
>>ElementSoup should do that for you. I fixed it on the trunk.
>
>>Stefan
>
> Unfortunately, I can't switch to lxml trunk. Would it be possible for you to
> point me to the code change in lxml so I can patch it myself?
>
> Thanks and Cheers,
> Viksit
>

Gmane