Leif Halvard Silli | 15 Dec 12:45 2011
Picon

Re: How to register 'unicode'/'unicodeFFFE' ?

Hi Martin,

Yes, agree that it would be good to hear those you contacted, 
especially Microsoft. 

I sent the two registrations that I had prepared - see separate letters.

It sounded as if you would like to see new charset registrations rather 
than adding new aliases for 'UTF-16'. I am a bit back and forth ... But 
now I am forth again, as I have been for a while: A good reason to have 
separate registrations is to emphasize and clarify how Microsoft and/or 
Internet Explorer have implemented the 16bit UTF encodings - it seems 
like it is not understood very well. And two, new, independent 
registrations, could clarify this, whereas an alias solution would 
require the legacy information to be stuffed somewhere - else, too. 

Btw, w.r.t. Anne's table, 
<http://wiki.whatwg.org/wiki/Web_Encodings#Encodings_3> then my 
conclusion is different from his: I may misunderstand the table, but 
'unicodeFFFE' is not the same as UTF-16BE, and 'unicode' is not the 
same as 'UTF-16' . Fact is that Internet Explorer, AFAICT, does not 
support 'UTF-16' (the 'bi-endian encoding'), due to the fact that IE 
sees 'utf-16' as alias for 'unicode' (a 'uni-little-endian encoding 
with the BOM'). Had it not been for the way they see 'utf-16' as alias 
for 'unicode', a unification in the form of making 'unicode' and 
'unicodeFFFE' into aliases of 'UTF-16', should have been much more 
straight forward.

regards,
Leif H Silli


"Martin J. Dürst", Thu, 15 Dec 2011 18:06:54 +0900:
> Hello Leif,
> 
> I haven't had enough time to look at the stuff below in detail. It 
> looks like a big can of worms :-(. But we have done other 
> registrations in the "HTML legacy murkiness" space recently, so let's 
> give this a try and see how far we get.
> 
> While it is good to have somebody like you championing the 
> registration effort, in this case it would be very helpful to get 
> some input from Microsoft (I have cc'ed Shawn) and the Unicode 
> Consortium (because of the name 'unicode', I cc'ed Patrick who is the 
> official IETF liaison to the Unicode Consortium). I can help with 
> that. Input from Paul and François (also cc'ed), authors of 
> http://tools.ietf.org/html/rfc2781, may also be valuable.
> 
> And yes, if you have some registration templates, please don't 
> hesitate to send them, it always helps to have something concrete to 
> look at. But personally, I'm rather skeptical about adding aliases 
> with murky variations to a registration that went through quite a few 
> drafts and then became an RFC.
> 
> Regards,    Martin.
> 
> On 2011/12/15 15:53, Leif Halvard Silli wrote:
>> Hi! I am ready to submit . and have prepared - two registrations for
>> the 'unicode' and the 'unicodeFFFE' charset. The two charsets are
>> variants of 'UTF-16', and they only differ from each others with regard
>> to their endianness. Each charset includes the BOM. The registrations
>> are based on Microsoft's specifications:
>> 
>> http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx

>> http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx

>> 
>> The purpose of the registrations would be to 'documents existing
>> practice in a large community' and should thus "be explicitly marked as
>> being of limited or specialized use and should only be used in Internet
>> messages with prior bilateral agreement".
>> 
>> http://tools.ietf.org/html/rfc2978#section-2.5

>> 
>> In an ideal world, 'unicode'/'uniocodeFFFE' would not be necessary to
>> register: We have 'UTF-16', for which the endianness can be signalled
>> via the BOM. Thus one can switch the endianness freely, without having
>> to relabel. For the 'unicode' and 'unicodeFFFE' charsets, by contrast,
>> if one changes the endianness, then one must also switch to the other
>> (or: another) charset label. This is a mayor reason to not use this
>> charset.
>> 
>> So far so good: Because both charsets include the BOM, the BOM takes
>> precedence - in particular if the name of the label is not supported by
>> the implementation. Opera and Firefox are in that category, for
>> example. And even Microsoft seems to be in that league, as e.g. IE has
>> no problems handling a little-endian file which includes the
>> 'unicodeFFFE' label, as long as the document *also* contains the BOM.
>> 
>> However, Microsoft's spec includes one additional detail which is not
>> only impractical but also dangerous: 'utf-16' (formally in lowercase)
>> is seen by the Microsoft spec as an alias for 'unicode' - the
>> little-endian charset variant. This is of course incompatible with the
>> 'UTF-16' charset and so the registrations I have prepared, reject this
>> detail. However, for applications that implements the current Microsoft
>> specification (such as IE), this still nevertheless mean that if your
>> 'text/html' document is big-endian, but without the BOM, and if you
>> then send 'UTF-16' via the HTTP Content-Type: charset parameter, then
>> you can be certain that IE treats the document as little-endian, with
>> 'mojibake' as result.
>> 
>>    (You probably would not like to send 'UTF-16' via HTTP Content-Type,
>> though - except as a 'back-up' solution in addition to the BOM, because
>> IE does not seem to cache encoding information sent this way. And so
>> your document would be misinterpreted if you used the back button. For
>> XML, by contrast, then the situation seems better than for 'text/html'
>> - perhaps because XML defaults to either UTF-8 or UTF-16.)
>> 
>> As I pondered over this, I first considered that 'unicode' and
>> 'unicodeFFFE' had to be registered as aliases for 'UTF-16'. However,
>> the fact that each charset supports only one 'subset' of UTF-16 (which
>> is a single charset/encoding), meant that it had to be two charsets, if
>> the Microsoft reality is taken as basis.
>> 
>> That said: We should support reality, and not Microsoft reality. And in
>> that regard: Because both charsets include the BOM, it is simple to
>> treat them as aliases for 'UTF-16' - it is only when you create an
>> invalid UTF-16 encoding (that is: you omit the BOM) that legacy IE
>> risks acting up. (IE always consider BOM before anything else - even
>> before HTTP Content-Type, it seems.)
>> 
>> So there are actually two possibilities here: EITHER to update the
>> 'UTF-16' registration to also cover 'unicode' and 'unicodeFFFE' - then
>> we would also pretty much automatically discourage their use as there
>> would be a clear recommendation in place to use the preferred name -
>> 'UTF-16' - instead. OR, the other option: To register them as two
>> separate charsets.
>> 
>> Making the two labels into aliases of 'UTF-16' would - formally -  give
>> them a more prominent status than registering them independently for
>> 'limited use'. To register them as aliases, would be to *not* base them
>> on 'Microsoft reality'. Such a thing could perhaps make Microsoft align
>> itself more with 'UTF-16' as she is registered? Another problem with
>> registering them as independent charsets, would be that it would be
>> more unclear how non-Microsoft products should handle them. Does anyone
>> know if IE10 is behaving any differently w.r.t. UTF-16? Is there a
>> direction towards the standard?
>> 
>> To update the UTF-16 registration seems simple - only a matter of
>> adding the aliases:
>> <http://www.iana.org/assignments/charset-reg/UTF-16>. So I have started
>> to, again, consider that the best option.
>> 
>> I had planned to send the registrations now, but I would like to gather
>> some responses first. However, if the expert reviewers would like, I
>> could post the registrations that I have prepared ASAP - often it is
>> better to have something concrete to look at. (That being said, I have
>> covered very many of the issues in this message ...)
>> 
>> With regards,
>> Leif H Silli
>> 

Gmane