4 May 07:34
Re: (no subject)
From: Stefan Behnel <stefan_ml <at> behnel.de>
Subject: Re: (no subject)
Newsgroups: gmane.comp.python.lxml.devel
Date: 2008-05-04 05:34:50 GMT
Subject: Re: (no subject)
Newsgroups: gmane.comp.python.lxml.devel
Date: 2008-05-04 05:34:50 GMT
Hi, mharper3 <at> uiuc.edu wrote: > I'm getting glibc/MemoryError/cStringIO crashes/exceptions from the following (minimal reproduction) code: > > <code> > import lxml.etree > > wiki_xml_filename = 'enwiki-latest-pages-articles.xml' # from http://download.wikimedia.org/enwiki/latest/ > context = lxml.etree.iterparse(wiki_xml_filename, events=("end")) > for action, elem in context: > pass > </code> > > The crash usually occurs about halfway through the file (around <page> > 3,000,000) The same code runs on smaller mediawiki xml files (200 mb) > without error. I only get this error for this very large xml file (in this > case about 13gb uncompressed). I had no trouble parsing the same file with > the python standard library sax parser, but it is much slower and I don't > like its api. > > Some of the exceptions are MemoryErrors. The machine running the code has > 4gb of ram. The kernel does not appear to significantly hit the swap during > the run. iterparse() builds a tree in memory, so parsing a 13gb file on a 4gb RAM machine will fail - *unless* you clean up the parts of the tree that you no longer need. Something like for action, elem in context: if elem.tag == "page": # handle page elem.clear() elif elem.tag in tag_names_of_ancestors_of_page_elements: elem.clear() might work for you. BTW, you can also parse the gzip compressed file directly, might even be faster. Stefan
RSS Feed