Mark Crispin | 24 May 2006 21:27

Re: Status of simplification work

On Wed, 24 May 2006, George Sexton wrote:
> Since the preferred character set is UTF-8, this is pretty difficult to do. 
> To actually limit lines correctly would require an intrinsic function in the 
> language to determine the length of each character, and if the current length 
> + current char length > 75,wrap the line.
>
> I program mostly in Java, and I don't know any intrinsic functions that give 
> you the length in bytes of a specific Unicode character.

Do you really need an intrinsic function, which it is so easy to write 
your own?

The following small C function translates a UCS-4 codepoint, including 
those codepoints which are not in Unicode (that is, those with values 
0x110000 - 0x8fffffff), into the corresponding number of UTF-8 octets.  It 
returns 0 for a codepoint which is not in UCS-4.

unsigned long utf8_size (unsigned long c)
{
   if (c < 0x80) return 1;
   else if (c < 0x800) return 2;
   else if (c < 0x10000) return 3;
   else if (c < 0x200000) return 4;
   else if (c < 0x4000000) return 5;
   else if (c < 0x80000000) return 6;
   return 0;
}

A simpler version, covering only Unicode codepoints, is:

unsigned long utf8_size (unsigned long c)
{
   if (c < 0x80) return 1;
   else if (c < 0x800) return 2;
   else if (c < 0x10000) return 3;
   else if (c < 0x110000) return 4;
   return 0;
}

> In the spirit of simplification, if the wrap recommendation is kept then it 
> should be revised to say 75 characters.

I suspect that the 75 octet length limitation is for buffer purposes in 
transport (as in certain email gateways which one hopes are long extinct) 
and not any sort of "character" limitation.

"Characters" are not particularly useful to limit in any case; that notion 
is hopelessly tied to fixed-width fonts and the notion that a "character" 
corresponds to a fixed-length field of real estate.  You can only go so 
far with that notion, even if you accomodate East Asian characters being 
double-width ("two characters"), and various marks applied to characters 
being zero-width ("zero characters").

A "glyph" limit might be useful, but even that is tied up with the notion 
of fixed-width fonts and is invalidated by scripts such as Arabic.

Of course, to do any of this you have to know how text is actually drawn 
on the screen, which tends not to be something done at the level of 
protocols.  Octet counts, on the other hand, are.

Consequently, I recommend that the wrap recommendation be retained, and 
that sample code such as the above be proved for the benefit of those 
people who have trouble understanding how to determine the octet length of 
a Unicode character in UTF-8.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.

Gmane