Taco Hoekwater | 1 Jun 12:56 2010

Re: BibTeXU

Karl Berry wrote:
>     I gather this is what used to be bibtex8 in former years?
> 
> bibtexu is/was a project by Yannis (and a student or two) to use the ICU
> library with BibTeX.  Peter also put in the massive efforts needed to
> make this work in the TL build system and have bibtexu and xetex use the
> same ICU library

Bibtexu does not actually seem to work all that well, or at least it
has some quirks on my linux 64 box. I experimented a bit because it
sounds promising. Long email follows.

I created a small test.aux file with just this in it:

\citation{*}
\bibstyle{plain}
\bibdata{xampl}

At first it complained that it could not find '88591lat.csf'. This
is probably just a packaging error: as it stands, the bibtexu package
should depend on bibtex8 (or the files have to be moved to the bibtexu
package, I do not know whether bibtex8 needs them).  I installed
bibtex8, and that took care of that.

But then, I got this:

[taco <at> ntg tmp]$ bibtexu test
The 8-bit codepage and sorting file: 88591lat.csf
The top-level auxiliary file: test.aux
The style file: plain.bst
Database file #1: xampl.bib
Terminated

I killed it after about five minutes, and by then it had used
2minutes CPU time, Resident size was 1G, and Virtual size 2.3G
(and growing).

valgrind gives about a gazillion

   'Conditional jump or move depends on uninitialised value(s)'

messages.

It seems \citation{*} is causing this trouble, because a test
without it runs fine (changed to \citation{article-full}).

Having found a working solution, now I wanted to see about
that 'u' at the end of the program name. Big disappointment
there: from the documentation in 'source', the 'u' apparently
stands for 'Unified' or so and at first glance it has nothing
to do with Unicode  at all. (I could have stopped there
because to me there would be little point to a drop-in
replacement of bibtex8).

Nevertheless, the line:

    The 8-bit codepage and sorting file: 88591lat.csf

gave the impression that that csf file is configurable.
00readme.txt from the source says there should be a command
line option:

        -c  --csfile FILE

but this option does not work nor is it listed in the -h
output: I get the help text echoed back at me (there are more
options listed in 00readme.txt that do exist, but I am not
in the mood to list them all).

The 00readme.txt from the source says you can set an
environment variable (BIBTEX_CSFILE), so I tried that:

[taco <at> ntg tmp]$ env BIBTEX_CSFILE=cp47lat.csf bibtexu xampl-latex
The 8-bit codepage and sorting file: 88591lat.csf
The top-level auxiliary file: xampl-latex.aux
The style file: plain.bst
Database file #1: xampl.bib

Didn't work. Continuing on, it turns out that kpsewhich cannot
find cp47lat.csf either, so I tried an absolute path:

[taco <at> ntg tmp]$ env 
BIBTEX_CSFILE=/home/taco/texlive/2010/texmf-dist/bibtex/csf/base/cp437lat.csf 
bibtexu xampl-latex
The 8-bit codepage and sorting file: 88591lat.csf
The top-level auxiliary file: xampl-latex.aux
The style file: plain.bst
Database file #1: xampl.bib

Doesn't work either. Then I remembered having seen a debug
switch: --debug=search:

[taco <at> ntg tmp]$ env BIBTEX_CSFILE=cp437lat.csf bibtexu --debug=search 
xampl-latex
The 8-bit codepage and sorting file: 88591lat.csf
The top-level auxiliary file: xampl-latex.aux
The style file: plain.bst
Database file #1: xampl.bib

Also doesn't seem to do anything. Un-phased, try with --debug=all:

[taco <at> ntg tmp]$ env BIBTEX_CSFILE=cp437lat.csf bibtexu --debug=all 
xampl-latex

Lots of output this time, but _nothing_ related to file searching.

Now I could have given up, but then I realized that perhaps the
u in bibtexu is about *input*, not output or whatever is implied
by '8-bit codepage and sorting file'. So I created a copy of
xampl.bib and changed Aamport to "Aaämport", saved as UTF-8,
and ran:

[taco <at> ntg tmp]$ bibtexu xampl-latex

Much to my surprise, the output is UTF-8! That is exactly what
I wanted, but what is all this talk about 8-bit csf files
about then? I don't understand that at all.

Never mind, now for the real experiment (this is where old bibtex
fails):

   \citation{article-full}
   \bibdata{xampl-utf}
   \bibstyle{alpha}

The "Aaämport" above makes bibtex and bibtex8 generate invalid
UTF-8 output in this case, because it takes the first 3 bytes
of the surname instead of the first 3 sequences (an important
difference in UTF-8). Here is what happens:

[taco <at> ntg tmp]$ bibtexu xampl-latex
The 8-bit codepage and sorting file: 88591lat.csf
The top-level auxiliary file: xampl-latex.aux
The style file: alpha.bst
Database file #1: xampl-utf.bib
6there is a error: U_ZERO_ERROR[taco <at> ntg tmp]$

It reports an error, but it *did* generate a bbl file, and the
content of that is correct UTF-8:

   \bibitem[Aaä86]{article-full}

Then I tried "The ḠṈÄȚŜ and Gnus Document Preparation System".
Output UTF-8: "The ḡṉäțŝ and gnus document preparation system"

It does work after all!

This now makes me believe that all this talk about csf files is
just a bit leftover noise that does not actually mean anything.

So what about that U_ZERO_ERROR report then? No idea. It happens
once for each \citation in the 'alpha' style (as well as in the
cont-xx.bst styles) but it seems harmless.

In the end, what is left is the \citation{*} bug, and a lot of
obsolete documentation, I think. (and it took me three hours
figuring this out).

Best wishes,
Taco


Gmane