very long files with many XML entity refs
I have a sample XML file which contains <text>‡‡ .... </text> with 8,000,000 (eight million) repetitions of '‡'.
A test program for loading it and then writing it is:
import sys
#import cElementTree as ET
from lxml import etree as ET
f=open(sys.argv[1])
et = ET.ElementTree(file = f)
et.write('ooo')
When it is run with cElementTree , it completes successfully in about 1 minute.
When it is run with lxml, it does not complete, even after 12 hours!!! and the process is constantly at 100% CPU.
Further testing showed it reaches the 'write' statement quite fast and is stuck in there.
Is this a bug or is lxml just dead slow relative to cElementTree , for this action?
Notes:
1) Nothing special about '‡', it is just a simple sample with the same character repeating. The original problem showed up with a long file of various entity refs (some encoding of binary data).
2) Testing with shorter files (thousands of characters), seemed to have similar speed for cElementTree and lxml.
TIA
Moshe
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev