jastrachan | 15 Jun 17:04 2004

Re: Heads Up XmlSlurper

On 15 Jun 2004, at 11:56, John Wilson wrote:
> On 15 Jun 2004, at 11:04, jastrachan@... wrote:
>> On 15 Jun 2004, at 10:00, John Wilson wrote:
>>> I have just added the class XmlSlurper to  
>>> org.codehaus.groovy.sandbox.util. It's a sandbox implementation so  
>>> it's experimental and subject to random change ;)
>>> Basically it's another way of implementing the functionality  
>>> provided by XmlParser. It differs in the following ways:
>>> 1/ It uses a different representation for the XML tree. Using arrays  
>>> rather than a List of nodes.
>> Any particular reason why? The main reason for List was that they  
>> support richer semantics like add, remove, sublist and the like.  
>> Remember that folks may wish to mutate Node instances, not just view  
>> XML as read only data
> It seemed like a good idea at the time (TM) :)


> I want to see if it really makes a worthwhile difference to the  
> performance. As we are mostly iterating through the List/array it may  
> not. XmlSlurper does not support the mutating of documents - mutating  
> a structure which may have partially completed iterators over it is  
> pretty tricky.

 From work on dom4j I can say that the biggest boost in performance I  
got was not creating a List / array if an element contained a single  
String / element. We could

>>> It uses iterators to return the result of a selection on the  
>>> document rather than dynamically generating Lists. This is an  
>>> attempt to achieve higher performance with a smaller memory  
>>> footprint. This may be an advantage when manipulating very large  
>>> documents and/or very complex documents.
>> Good idea.
> But see my caveat about mutating the document above.

Agreed. Though I don't think List v array makes that much of a  
difference - particularly as you often need to use a List to create an  
arbitrary length array as you're parsing - as you don't know how many  
you're gonna get.

>>> 2/  If a selection returns just one element the element is returned  
>>> rather than an array of one element. This removed the need for the  
>>> annoying [0] at the end of a GPath statement:
>>> data.CheckSum[0].attributes()
>>> can be written as
>>> data.CheckSum.attributes()
>>> (the old version is supported as well)
>> I guess that makes life much simpler, especially if the old way is  
>> supported too. The worry with leaving off the [0] though is if a  
>> document contains 2 CheckSum elements then things break
> Yes - actually I *want* that to break.
> Using data.CheckSum.attributes() says "I assert that there is only  
> ever one of these elements"

Or how about, I only ever want the first one?

> Using data.CheckSum[0].attributes() says "there may be one or more of  
> these elements and I want the first one"
> Using data.CheckSum.each {it.attributes()...} says "there may be zero  
> or more of these elements and i want them all"

Thats fair enough.

> I can do data.CheckSum[3].attributes() with the iterator but it's  
> quite expensive and I'd rather not do it unless I have to.

It could use array lookup though right?

>>> 3/ Because Lists are not returned you can't do this any more
>>> println root.child[3] (but root.child.size() works)
>>> to examine the results of a selection you have to use each {}
>> The big problem with returning Iterator instances is that you can't  
>> reset them. So once you've navigated through them, you're done.  
>> Though I guess if we used ListIterator we could be a bit more clever.
>> I'm wondering if we could support subscript indexing for ListIterator?
>> e.g. supporting the method ListIterator.getAt(int idx) we could start  
>> at the beginning and iterate to the idx element.
> It doesn't quite work like that. The result of root.child is not  
> itself an iterator, it's a class which has an iterator() method. This  
> method creates a new iterator every time it is called (not a very  
> expensive operation)
> so x = root.child
> x. each { println it.name()}
> x. each { println it.name()}
> prints the names out twice. The iterator doesn't have to be reset

Ah I thought x would be an iterator. Cool then!

>>> I intend to experiment with ways of handling namespaces with this  
>>> class.
>> I'd always assumed we could use things like
>> someQName = new QName(uri, localname)
>> value = node[someQName]
>> I did ponder about creating a namespace filter facade, so you could  
>> effectively filter an XML tree for a given namespace URI. Typically  
>> we only want to look at an XML document using the 1 namespace we're  
>> interested in.
>> e.g.
>> doc = <x:foo  
>> xmlns:x="http://foo.com/whatnot"><x:bar>123</x:bar></x:foo>
>> ...
>> myUri = "http://foo.com/whatnot"
>> myDoc = doc.namespace(myUri)
>> // now we can navigate the myDoc facade ignoring namespaces, as all  
>> navigations will be in my namespace
>> // as each 'local name' is turned into a QName
>> answer = myDoc.foo.bar
>> // or we can use QNames
>> //  the Namespace class is a factory of QNames
>> myNS = new Namespace(myUri)
>> foo = myNS.foo
>> bar = myNS.bar
>> answer = doc[foo][bar]
>> or
>> answer = doc[myNS.foo][myNS.bar]
> I'm thinking of applying filtering in the path
> root.inNamespace("http://foo.com/whatnot").child.inNamespace("http:// 
> bar.com/whatnot").data.text()

Good idea.

Typically though folks navigate though multiple paths in the same  
namespace; so rather than having to use the namespace method per step,  
it'd be nice to define a namespace, walk through it until you need  
another namespace. (I''m not sure if thats what you were thinking in  
the above, as you were only performing one step per namespace)



i.e. we only need to use the function call to switch to the namespace  
we're interested in, then we can keep navigating as much as we like.

If we only need 1 namespace to navigate through, then its like my  
previous example...

myDoc = doc.asNamespace(someURI)

> so <child> is in http://foo.com/whatnot and data in  
> http://bar.com/whatnot
> I'm seeing documents with the envelope in one namespace, some  
> structure in a second and data (things like, addresses, dates) in a  
> third namespace.


> Attributes in namespaces is another problem altogether!


Thankfully, many uses of attributes are still not using namespaces  
(i.e. most people just namespace the elements and leave the attributes  
non-namespaced) but we need a good solution for this too.

>>>  If the performance gains over XmlParser are not significant I would  
>>> propose to fold the cosmetic change 2/ and any work on namespaces  
>>> back into XmlParser. If it is significantly faster it may live on as  
>>> a separate class.
>> I'd be happy to fold the changes right into XmlParser (and Node if  
>> need be), so long as they don't change the existing semantics much.  
>> i.e. if all the test cases still work, I'd be happy for the  
>> implementation to use iterators. I'm less sure about the array change  
>> (1) though.
> I think change 2/ would be worthwhile. Iterators and mutable documents  
> my be a problem, though. The XmlSlurper itorator implementation was  
> quite difficult to do and is still pretty ugly. I'll plug away at  
> refactoring it to see if I can make it a bit simpler. I would not be  
> happy in moving into a mainstream  class at the moment.


>>> I have been using GPath on real, large and complex XML documents  
>>> using both XmlParser and XmlSlurper and I have to tell you GPath  
>>> absolutely rocks!
>> :)
>> There are still some places that XPath is better I think - though the  
>> power of GPath is that it can work on any object structure - not just  
>> the XPath model of the XML InfoSet.
>> BTW I wonder if for GroovyMarkup we should actually support the  
>> pointy bracket notation. e.g. the examples above
>> doc = <foo>123</foo>
>> is kinda neat. Much neater than
>> doc = new XmlParser().parseText("<foo>123</foo>")
>> Also we could then make GroovyMarkup look like this...
>> doc = builder.<foo>123</foo>
>> John Rose brought this up on the JSR list when discussing some of the  
>> issues with GroovyMarkup and scoping rules.
>> http://docs.codehaus.org/display/GroovyJSR/ 
>> property+versus+field+scoping?showComments=true#comment-3994
>> http://docs.codehaus.org/display/GroovyJSR/specifying+GroovyMarkup
>> It'd also make handling of GroovyMarkup and namespaces much easier
> I'm not very keen on this.
> Holger Krekel gave a presentation on this sort of extension to Python  
> (XPython-  
> http://www.europython.org/conferences/epc2004/info/talks/ 
> python_language/hpk01) at Europython (interestingly one of two  
> separate proposals to add MarkupBuilder type functionality to Python).


> The general feeling seemed to be "keep those bloody pointy brackets  
> out of our language" - I quite agree!


I tend to agree - though I'm starting to think we need some kind of  
separate syntax to separate out 'markup' from normal method calls to  
avoid scoping ambiguities mentioned in the above links. I'm open to  
suggestions, but some kind of angle brackets are a possible solution -  
there's other suggestions in the above links