4 May 12:53
Re: parsing a large file with iterparse()
From: Stefan Behnel <stefan_ml <at> behnel.de>
Subject: Re: parsing a large file with iterparse()
Newsgroups: gmane.comp.python.lxml.devel
Date: 2008-05-04 10:53:34 GMT
Subject: Re: parsing a large file with iterparse()
Newsgroups: gmane.comp.python.lxml.devel
Date: 2008-05-04 10:53:34 GMT
Hi,
Stefan Behnel wrote:
> From: <mharper3 <at> uiuc.edu>
>
> Also, adding
>
> elem.clear()
>
> into the loop still eventually leads to a memory error, just much later. This
> should be clearing every element, so I'm not quite sure if I understand what
> clear() actually does.
According to the docs:
"""
clear()
Resets an element. This function removes all subelements, clears all
attributes and sets the text and tail properties to None.
"""
So it does not remove the element itself. I don't know what your XML looks
like, but if it's something like
<root>
<a>...</a> * a zillion
</root>
and you handle the end event of the <a> element and clear() it, you still end
up with a tree that has a zillion empty <a/> children.
I see two choices in this case. There is cElementTree, which has the same API
and allows you to clear the root element.
http://effbot.org/zone/element-iterparse.htm#incremental-parsing
This does not work in lxml as you cannot delete elements that are still
required by the tree traversal of the parser (i.e. parents and following
siblings).
But you can try this in lxml:
for action, elem in context:
if elem.tag == "page":
# handle page
elem.clear()
# remove all previous siblings
parent = elem.getparent()
previous_sibling = elem.getprevious()
while previous_sibling is not None:
parent.remove(previous_sibling)
previous_sibling = elem.getprevious()
BTW, if you only look for "page" tags and do the sibling cleanup as above, you
can just pass tag="page" to iterparse().
Stefan
RSS Feed