Stefan Behnel | 4 May 12:18

Re: saving memory with iterparse()

Hi,

Stefan Behnel wrote:
> From: <mharper3 <at> uiuc.edu>
> Thanks so much for the quick response. I did consider that the tree was being
> built in memory, but the documentation seems to suggest that is not the case.
> Specifically the language in the tutorial
> (http://codespeak.net/lxml/tutorial.html) in both the sections 'incremental
> parsing' and 'event-driven parsing' seem to suggest using iterparse to access
> without retaining the tree in memory.

It actually says:

"""
two event-driven parser interfaces, one that generates parser events
while building the tree (``iterparse``), and one that does not build the tree
at all, and instead calls feedback methods on a target object in a SAX-like
fashion.
"""

but I added a new example now that shows how to save memory.

http://codespeak.net/lxml/tutorial.html#event-driven-parsing

> If you don't mind, why does the
> iterator retain the tree in memory? I would suspect otherwise from the
> 'natural' behavior of iterators/generators in general, though that may be an
> invalid assumption. [...]
> My mistake was to assume that the
> 'used' elements would be freed without an explicit call to do so as the
> iterator progressed.

The question is: how should iterparse() know when you no longer need a
subtree? The end event for a parent always comes after the end events of all
its children and you might still access the whole subtree when you handle the
parent.

> (i.e. I would parse the entire tree into memory if I
> thought that I had enough memory to do so; otherwise I would _incrementally_
> parse it.)

The docs actually use two terms: "incremental parsing" and "event-driven
parsing". Incremental parsing is used for feeding data into the parser one
chunk at a time, while event-driven parsing means you also get back one parser
event at a time.

If you have an idea how to present this better, I take patches:

http://codespeak.net/svn/lxml/trunk/doc/tutorial.txt

Stefan

Gmane