2 May 2007 16:45
Re: SV: XQuerying for a single page
Sylvain Loiseau <sloiseau <at> U-PARIS10.FR>
2007-05-02 14:45:02 GMT
2007-05-02 14:45:02 GMT
> We're putting our texts into the eXist database. We want to be able
> to show one page of the original text at the time, so we'll be needing
> an XQuery expression to retrieve the specific portion of a text, namely
> between two specific <pb/>'s. Problems arise when the <pb/> fall in
> the middle of another element, say, a <p>. There might also be a
> <div>-break on the middle of the page, etc.
Dealing with two intersecting hierarchies is a problem CorpusReader try to help
to deal with (can I say that?), see:
http://panini.u-paris10.fr/~sloiseau/CR/filtres/ExtractMilestone.html
(the doc is in french but there is samples). The idea is to be able to convert
into "normal" element any milestoned element (such as "pb") -- if necessary,
converting into milestone the element conflicting with this created elements.
For instance, with a rule such as:
<filter name="foo" javaClass="tei.cr.filters.ExtractMilestone">
<args>
<startBoundary elxpath="tei:pb" />
</args>
</filter>
You can transform this tree:
<p> <!-- first paragraph -->
<pb/> <!-- boundary between first and second page -->
</p>
<p> <!-- second paragraph -->
<pb/> <!-- boundary between second and third page -->
<p>
Into this one:
<pb> <!-- first page -->
<p> <!-- first paragraphe... -->
</p>
</pb>
<pb> <!-- second page -->
<p> <!-- first paragraphe, cont. -->
</p>
<p> <!-- second paragraphe -->
</p>
</pb>
<pb> <!-- third page -->
<p> <!-- second paragraphe, cont. -->
</p>
</pb>
----
The conversion between milestone/hierarchy is done using the SAX API, which
is far more suited for this purpose than the DOM API (used by XPath), since
the precedence order is more usefull than the dominance order for identifying
and transforming milestones.
If you're interested with this solution, let me know, I'm currently rewriting all
this code for adding various milestone scheme support.
Sylvain
RSS Feed