jastrachan | 15 Jun 12:04 2004

Re: Heads Up XmlSlurper

On 15 Jun 2004, at 10:00, John Wilson wrote:
> I have just added the class XmlSlurper to  
> org.codehaus.groovy.sandbox.util. It's a sandbox implementation so  
> it's experimental and subject to random change ;)
> Basically it's another way of implementing the functionality provided  
> by XmlParser. It differs in the following ways:
> 1/ It uses a different representation for the XML tree. Using arrays  
> rather than a List of nodes.

Any particular reason why? The main reason for List was that they  
support richer semantics like add, remove, sublist and the like.  
Remember that folks may wish to mutate Node instances, not just view  
XML as read only data

> It uses iterators to return the result of a selection on the document  
> rather than dynamically generating Lists. This is an attempt to  
> achieve higher performance with a smaller memory footprint. This may  
> be an advantage when manipulating very large documents and/or very  
> complex documents.

Good idea.

> 2/  If a selection returns just one element the element is returned  
> rather than an array of one element. This removed the need for the  
> annoying [0] at the end of a GPath statement:
> data.CheckSum[0].attributes()
> can be written as
> data.CheckSum.attributes()
> (the old version is supported as well)

I guess that makes life much simpler, especially if the old way is  
supported too. The worry with leaving off the [0] though is if a  
document contains 2 CheckSum elements then things break

> 3/ Because Lists are not returned you can't do this any more
> println root.child[3] (but root.child.size() works)
> to examine the results of a selection you have to use each {}

The big problem with returning Iterator instances is that you can't  
reset them. So once you've navigated through them, you're done. Though  
I guess if we used ListIterator we could be a bit more clever.

I'm wondering if we could support subscript indexing for ListIterator?

e.g. supporting the method ListIterator.getAt(int idx) we could start  
at the beginning and iterate to the idx element.

> I intend to experiment with ways of handling namespaces with this  
> class.

I'd always assumed we could use things like

someQName = new QName(uri, localname)
value = node[someQName]

I did ponder about creating a namespace filter facade, so you could  
effectively filter an XML tree for a given namespace URI. Typically we  
only want to look at an XML document using the 1 namespace we're  
interested in.


doc = <x:foo xmlns:x="http://foo.com/whatnot"><x:bar>123</x:bar></x:foo>
myUri = "http://foo.com/whatnot"
myDoc = doc.namespace(myUri)

// now we can navigate the myDoc facade ignoring namespaces, as all  
navigations will be in my namespace
// as each 'local name' is turned into a QName
answer = myDoc.foo.bar

// or we can use QNames
//  the Namespace class is a factory of QNames
myNS = new Namespace(myUri)
foo = myNS.foo
bar = myNS.bar
answer = doc[foo][bar]


answer = doc[myNS.foo][myNS.bar]

>  If the performance gains over XmlParser are not significant I would  
> propose to fold the cosmetic change 2/ and any work on namespaces back  
> into XmlParser. If it is significantly faster it may live on as a  
> separate class.

I'd be happy to fold the changes right into XmlParser (and Node if need  
be), so long as they don't change the existing semantics much. i.e. if  
all the test cases still work, I'd be happy for the implementation to  
use iterators. I'm less sure about the array change (1) though.

> I have been using GPath on real, large and complex XML documents using  
> both XmlParser and XmlSlurper and I have to tell you GPath absolutely  
> rocks!


There are still some places that XPath is better I think - though the  
power of GPath is that it can work on any object structure - not just  
the XPath model of the XML InfoSet.

BTW I wonder if for GroovyMarkup we should actually support the pointy  
bracket notation. e.g. the examples above

doc = <foo>123</foo>

is kinda neat. Much neater than

doc = new XmlParser().parseText("<foo>123</foo>")

Also we could then make GroovyMarkup look like this...

doc = builder.<foo>123</foo>

John Rose brought this up on the JSR list when discussing some of the  
issues with GroovyMarkup and scoping rules.


It'd also make handling of GroovyMarkup and namespaces much easier