Laurence Rowe | 6 May 2009 00:04
Picon
Gravatar

Re: Ingore namespace when parsing

2009/5/2 Aaron Maxwell <amax <at> redsymbol.net>:
> On Friday 01 May 2009 11:00:29 am John Lovell wrote:
>> Aaron:
>>
>> It sounds to me like you could use an xpath query.
>> rootElement.xpath('//*[local-name() = 'Child1')
>> http://codespeak.net/lxml/xpathxslt.html
>
> Thanks, that does work fine.
>
> My actual problem is somewhat more complex than the simplistic example I gave,
> however.  The structure of the XML document is more like this (lots of the
> actual document is excised):
> {{{
> <ItemLookupResponse
> xmlns="http://webservices.amazon.com/AWSECommerceService/2008-04-07">
>  <OperationRequest>
>  <Items>
>    <Item>
>      <ASIN>0521545668</ASIN>
>      <OfferSummary>
>         (snip)
>      </OfferSummary>
>      <Offers>
>        <Offer>
>          <OfferListing>
>            <Price>
>              <Amount>7517</Amount>
>            </Price>
> (snip)
> }}}
>
> This is from Amazon's Associate Web Service API, incidentally.  What's needed
> is to extract the prices for the offers.  So I first obtain an offer
> element - the easiest way is to use exactly the xpath expression you
> mentioned:
>
> {{{
> offers = tree.xpath('//*[local-name()="Offer"])
> }}}
>
> Then for each offer in offers, I want to get the price information, i.e. the
> content of that Amount tag.  This works:
> {{{
> def price(offer):
>    return
> offer.xpath('*[local-name()="OfferListing"]/*[local-name()="Price"]/*[local-name()="Amount"]')
> [0].text
> }}}
>
> But, in a word, "yikes".  There has got to be a less verbose way!  I can't
> skip any of those intermediate elements (there are multiple leaf elements
> named Amount, for example; only the specific one above is the actual sale
> price.)  So something like
> {{{'*[local-name()="OfferListing"]//*[local-name()="Amount"]'}}} fails by
> mixing in garbage with the correct result.
>
> (This will probably improve once I learn xpath a little better - still in the
> process of mastering it.)
>
> Anyway, thanks for the xpath suggestion, John - it's probably better than the
> ns()/no_ns() functions in my first post.  Would still be useful if there is a
> way to instruct lxml.etree to somehow strip out the namespace prefix more
> automatically, if anyone can suggest that.

You can supply a namespaces argument to the xpath method:
{{{
offers = tree.xpath('//aws:Offer',
namespaces=dict(aws="http://webservices.amazon.com/AWSECommerceService/2008-04-07"))
}}}

See http://codespeak.net/lxml/xpathxslt.html for the details.

Laurence

Gmane