Michael Beddow | 1 Sep 2003 15:15

Re: Converting from any HTML to TEI

Al Magary wrote:

> I introduced myself a couple months ago as a newbie to TEI and
> have been trying valiantly to follow the traffic, particularly
> this thread as it seems learnable.  Now, in Conal Tuohy's post
> (below), he describes a process that begins in MS Word--which,
> shall we say, is EZ to understand--and end with a properly TEI
> encoded (?) document.

This is indeed useful as a first step if you *have* to start from a Word
document. But it begs the question as to whether, in your particular case as
outlined in your initial postings, you really should be starting from a Word
document, since you are creating an edition ab initio.

You originally wrote about making your initial transcription in a
three-column Word document. Line numbers in l.h. column, diplomatic
transcription in centre, annotations in r.h. column. I took this to mean you
were for the time being simply using the PC as an electronic notebook, with
the actual encoding to follow once you had your transcription and
annotations prepared, which is a common and sensible practice.

You best next step would be to create your encoded version in a
Windows-compatible dedicated XML editor, cutting and pasting the material
from your Word table as you go. None of the conversion routines so far
described would generate useful XML from your three-column Word working
document without pretty arduous customisation, and a lot of the markup your
editorial aims require is not of the type that can readily be pre-encoded
via named Word styles. To try to encode in Word in your circumstances would
add a big layer of additional complications and difficulties that would soon
outweigh the initial pain of learning a different style of editing software.

To be clear on one point: I'm not denying that conversion from Word to quite
complex TEI markup isn't possible. I could hardly do so, since I've just
finished performing such a conversion on the 10,000-plus A-E entries of the
Second Edition of the Anglo-Norman Dictionary, using a mix of Perl and XSLT
on the files as resaved in OpenOffice native format. But that was because
there really was no other way, after the editors had spent a decade and more
preparing and checking the entries in Word: the risk of their meticulous
labours being corrupted by any sort of rekeying were just too high, quite
apart from the time involved. But the next phase, F-H which is about to
begin, will be done in XML from the start (and the editors will use XML
software to amend and update the A-E entries as and when necessary).  In my
view, based on pretty hard-won experience, if the task is to create from
scratch, employing TEI markup, a dictionary, an extensive bibliography, a
critical edition with elaborate apparatus, a terminological database, or
indeed anything else other than fairly "flat" documents with little beyond
presentational markup,  then it would be plain crazy to contemplate doing
this in Word and then converting to XML later.

Of course, many documents in TEI-land are a lot simpler than that, and if
the ability to create in Word and then save into a proto-TEI XML, such as is
offered by Sebastian's OO filters or Chuck's Word macros,  brings people
into TEI encoding who would otherwise stay well away, then that's all to the
good. But, as ever, horses for courses.

Michael Beddow


Gmane