Alex Klizhentas | 8 Sep 17:21

context parameter thread safety

Hi All,

The context is a first parameter in the xpath/xslt extension functions and the tutorial states that it can be used to save function state.
I wonder whether it is thread safe.

Regards,
Alex
_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Stefan Behnel | 8 Sep 13:48

Re: Resolving entities

Kovid Goyal wrote:
> My application needs to process XML files that do not have DTD
> declarations but that contain entities.

In this case your document is not well-formed, i.e. not XML.

http://www.w3.org/TR/REC-xml/#sec-references

> Can I inform XMLParser of the entities somehow?

No, there isn't currently a way to work around such a broken document.
libxml2 follows the XML spec strictly in that it rejects references to
undeclared entities in the absence of a DTD.

ElementTree lacks DTD support and instead allows you to specify entities
through a parser local "entity" dictionary. lxml could potentially support
a similar interface by intercepting the entity reference resolving at the
SAX layer ("getEntity()" callback function), but that's not implemented.
Please file a wishlist bug.

Stefan
James Graham | 7 Sep 22:13

lxml.html adds a default doctype to HTML documents

In [2]: from lxml import html

In [3]: t = html.fromstring("<html><p>Hello World")

In [4]: docinfo = t.getroottree().docinfo

In [5]: docinfo.public_id
Out[5]: '-//W3C//DTD HTML 4.0 Transitional//EN'

Is it possible to prevent this from occurring? I couldn't see anything in the 
API documentation but I might have been missing something obvious. Silently 
gaining incorrect data is annoying :)

--

-- 
"Eternity's a terrible thought. I mean, where's it all going to end?"
  -- Tom Stoppard, Rosencrantz and Guildenstern are Dead
Alex Klizhentas | 6 Sep 14:18

Preventing XPath injection

Hi All, 
I'm facing the following issue:

xslt transformations accept xpath expressions as parameters, and if you write something like:

transform(a,param = " '  '  ' ") - xpath evaluation will fail. Is there any common/standard way to prevent that?

Alex

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Stefan Behnel | 5 Sep 14:28

lxml 2.1.2 and 2.0.9 released

Hi,

lxml 2.1.2 and 2.0.9 are on PyPI. Both are bug-fix releases for the stable and
mature release series. They mainly fix a thread-related memory problem that
was introduced in the last releases of both branches. Updating is recommended.

The complete changelog follows below.

Have fun,
Stefan

2.0.9 (2008-09-05)
Bugs fixed

    * Memory problem when passing documents between threads.
    * Target parser did not honour the recover option and raised an exception
      instead of calling .close() on the target.

2.1.2 (2008-09-05)
Features added

    * lxml.etree now tries to find the absolute path name of files when
      parsing from a file-like object. This helps custom resolvers when
      resolving relative URLs, as lixbml2 can prepend them with the path of
      the source document.

Bugs fixed

    * Memory problem when passing documents between threads.
    * Target parser did not honour the recover option and raised an exception
      instead of calling .close() on the target.
Stefan Behnel | 2 Sep 11:31

[Fwd: [Bug 263898] [NEW] Windows Installer crashes due to access violation]

Hi,

has anyone seem this before or could someone please test if this is
reproducible on other machines?

Thanks,
Stefan

--------------------------------------------------------------------------
Subject: [Bug 263898] [NEW] Windows Installer crashes due to access violation
--------------------------------------------------------------------------

Public bug reported:

I tried to install lxml-2.1.1.win32-py2.5.exe on my WinXP PC and got the
attached
crash from the installer.
It happened after confirming all settings and clicking the "Next" button
on the "Ready to install" page
of the installer.

Using Visual Studio Debugger to investigate crash :

Callstack :
 	ntdll.dll!_RtlEnterCriticalSection <at> 4()  + 0xb
 	msvcr71.dll!_lock_file(void * pf=0x00000000)  Line 236	C
>	msvcr71.dll!fprintf(_iobuf * str=0x00000000, const char *
format=0x0012d278, ...)  Line 63 + 0x6	C
 	lxml-2.1.1.win32-py2.5.exe!00402ca8()
 	user32.dll!77d48734()
 	user32.dll!77d48bd9()
 	user32.dll!77d541dc()
 	user32.dll!77d541a9()
 	user32.dll!77d53fd9()
 	ntdll.dll!_RtlpFreeToHeapLookaside <at> 8()  + 0x26
 	ntdll.dll!_RtlFreeHeap <at> 12()  + 0x114
 	ntdll.dll!_RtlpFreeAtom <at> 4()  + 0x1b
 	c5ffffff()

Looking at the fprintf function I can see that "str" variable is NULL.
_lock_file function using NULL pointer the access struture variable which
eventually results in the reported crash.

** Affects: lxml
     Importance: Undecided
         Status: New

--

-- 
Windows Installer crashes due to access violation
https://bugs.launchpad.net/bugs/263898
Dirk Holtwick | 31 Aug 18:21

Use cssselect.py in Pyxer

Hi,

I wrote (yet another) templating language for Python based on Genshi, 
since Genshi itself does not yet work on Google App Engine (GAE). Since 
Genshi supports XPath I was thinking about using your cssselect.py 
module together with it. First tests showed that this seems to work fine.

Now I would like to ship a little bit modified version of cssselect.py 
with this new templating language called "Pyxer"

	http://code.google.com/p/pyxer/

so the users do not have to install the whole lxml package (which does 
not work with GAE anyways I suppose).

Since Python "lxml" is under the BSD license and Pyxer under MIT license 
I think this should not be such a big problem as long as I add your 
copyright notices to the file. Am I right?

Thanks for your excellent work
Dirk
Stefan Behnel | 29 Aug 11:08

Re: very long files with many XML entity refs

Moshe Cohen wrote:
> You mentioned there that it was also slow for you using xmllint. For me,
> xmllint'ing it works very fast. Did you use any special options with
> xmllint?

Yes, for a fair comparison, I used similar options as for lxml, i.e.

    xmllint --noent --encode us-ascii

Stefan
Moshe Cohen | 29 Aug 01:42

very long files with many XML entity refs

I have a sample XML file which  contains <text>&#135;&#135; .... </text>  with 8,000,000 (eight million) repetitions of '&#135'.

A test program for loading it and then writing it is:

import sys
#import cElementTree as ET
from lxml import etree as ET
f=open(sys.argv[1])
et = ET.ElementTree(file = f)
et.write('ooo')

When it is run with cElementTree , it completes successfully in about 1 minute.
When it is run with lxml, it does not complete, even after 12 hours!!! and the process is constantly at 100% CPU.
Further testing showed it reaches the 'write' statement quite fast and is stuck in there.

Is this a bug or is lxml just dead slow relative to cElementTree , for this action?

Notes:
1) Nothing special about '&#135;', it is just a simple sample with the same character repeating. The original problem showed up with a long file of various entity refs (some encoding of binary data).
2) Testing with shorter files (thousands of characters), seemed to have similar speed for cElementTree  and lxml.

TIA
Moshe

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Santoro, Peter | 28 Aug 13:58

questions about lxml xsd:include and xsd:import behavior

After reading http://kontrawize.blogs.com/kontrawize/2007/10/the-trouble-wit.html, I have a couple of questions about how lxml and libxml2 handle imports and includes.  I can always run some tests to figure this out for myself, but it would be nice if someone who already knows the official/correct answers could share them.

 

1) The XML Schema standard appears to leave it up to the parser to determine how to resolve importing multiple schemas from the same namespace.  Apparently, some parsers only import the first schema from a given namespace, which could lead to missing definition errors.  What is the behavior of lxml/libxml2 here?

 

2) It appears that some xml parsers incorrectly treat multiple relative includes of the same file as being different.  This behavior would cause the same file to be included more than once, causing redefinition errors.  What is the behavior of lxml/libxml2 here?

 

Thank you,

 

Peter

_______________________________________________
lxml-dev mailing list
lxml-dev <at> codespeak.net
http://codespeak.net/mailman/listinfo/lxml-dev
Stefan Behnel | 27 Aug 11:35

Re: .text_content() should leave spaces. Tests included

Max Ivanov wrote:
>>>      for el in doc.iter():
>>>          if el.text and (el.tag not in self.inlinetags):
>>>              el.text = ''.join((' ',el.text))
>>>          if el.tail and (el.tag not in self.inlinetags):
>>>              el.tail += ' '
>>>          if el.tag == 'br':
>>>              if el.tail and not el.tail.startswith('\n'):
>>>                  el.tail = '\n'+el.tail
>>>              else:
>>>                  el.tail = '\n'
>>>              el.drop_tag()
>>
>> You're modifying the tree here, which is inacceptable for a function
>> that
>> returns a (partial) string serialisation. Apart from that, this seems
>> like a workable solution to your problem.
>
> What's wrong with modifying tree?

I was seeing it in the context of the text_content() method, where tree
modification must not happen.

Stefan

Gmane