15 Dec 2011 12:45
Re: How to register 'unicode'/'unicodeFFFE' ?
Leif Halvard Silli <xn--mlform-iua <at> xn--mlform-iua.no>
2011-12-15 11:45:25 GMT
2011-12-15 11:45:25 GMT
Hi Martin, Yes, agree that it would be good to hear those you contacted, especially Microsoft. I sent the two registrations that I had prepared - see separate letters. It sounded as if you would like to see new charset registrations rather than adding new aliases for 'UTF-16'. I am a bit back and forth ... But now I am forth again, as I have been for a while: A good reason to have separate registrations is to emphasize and clarify how Microsoft and/or Internet Explorer have implemented the 16bit UTF encodings - it seems like it is not understood very well. And two, new, independent registrations, could clarify this, whereas an alias solution would require the legacy information to be stuffed somewhere - else, too. Btw, w.r.t. Anne's table, <http://wiki.whatwg.org/wiki/Web_Encodings#Encodings_3> then my conclusion is different from his: I may misunderstand the table, but 'unicodeFFFE' is not the same as UTF-16BE, and 'unicode' is not the same as 'UTF-16' . Fact is that Internet Explorer, AFAICT, does not support 'UTF-16' (the 'bi-endian encoding'), due to the fact that IE sees 'utf-16' as alias for 'unicode' (a 'uni-little-endian encoding with the BOM'). Had it not been for the way they see 'utf-16' as alias for 'unicode', a unification in the form of making 'unicode' and 'unicodeFFFE' into aliases of 'UTF-16', should have been much more straight forward. regards, Leif H Silli "Martin J. Dürst", Thu, 15 Dec 2011 18:06:54 +0900: > Hello Leif, > > I haven't had enough time to look at the stuff below in detail. It > looks like a big can of worms . But we have done other > registrations in the "HTML legacy murkiness" space recently, so let's > give this a try and see how far we get. > > While it is good to have somebody like you championing the > registration effort, in this case it would be very helpful to get > some input from Microsoft (I have cc'ed Shawn) and the Unicode > Consortium (because of the name 'unicode', I cc'ed Patrick who is the > official IETF liaison to the Unicode Consortium). I can help with > that. Input from Paul and François (also cc'ed), authors of > http://tools.ietf.org/html/rfc2781, may also be valuable. > > And yes, if you have some registration templates, please don't > hesitate to send them, it always helps to have something concrete to > look at. But personally, I'm rather skeptical about adding aliases > with murky variations to a registration that went through quite a few > drafts and then became an RFC. > > Regards, Martin. > > On 2011/12/15 15:53, Leif Halvard Silli wrote: >> Hi! I am ready to submit . and have prepared - two registrations for >> the 'unicode' and the 'unicodeFFFE' charset. The two charsets are >> variants of 'UTF-16', and they only differ from each others with regard >> to their endianness. Each charset includes the BOM. The registrations >> are based on Microsoft's specifications: >> >> http://msdn.microsoft.com/en-us/library/aa752010(v=VS.85).aspx >> http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx >> >> The purpose of the registrations would be to 'documents existing >> practice in a large community' and should thus "be explicitly marked as >> being of limited or specialized use and should only be used in Internet >> messages with prior bilateral agreement". >> >> http://tools.ietf.org/html/rfc2978#section-2.5 >> >> In an ideal world, 'unicode'/'uniocodeFFFE' would not be necessary to >> register: We have 'UTF-16', for which the endianness can be signalled >> via the BOM. Thus one can switch the endianness freely, without having >> to relabel. For the 'unicode' and 'unicodeFFFE' charsets, by contrast, >> if one changes the endianness, then one must also switch to the other >> (or: another) charset label. This is a mayor reason to not use this >> charset. >> >> So far so good: Because both charsets include the BOM, the BOM takes >> precedence - in particular if the name of the label is not supported by >> the implementation. Opera and Firefox are in that category, for >> example. And even Microsoft seems to be in that league, as e.g. IE has >> no problems handling a little-endian file which includes the >> 'unicodeFFFE' label, as long as the document *also* contains the BOM. >> >> However, Microsoft's spec includes one additional detail which is not >> only impractical but also dangerous: 'utf-16' (formally in lowercase) >> is seen by the Microsoft spec as an alias for 'unicode' - the >> little-endian charset variant. This is of course incompatible with the >> 'UTF-16' charset and so the registrations I have prepared, reject this >> detail. However, for applications that implements the current Microsoft >> specification (such as IE), this still nevertheless mean that if your >> 'text/html' document is big-endian, but without the BOM, and if you >> then send 'UTF-16' via the HTTP Content-Type: charset parameter, then >> you can be certain that IE treats the document as little-endian, with >> 'mojibake' as result. >> >> (You probably would not like to send 'UTF-16' via HTTP Content-Type, >> though - except as a 'back-up' solution in addition to the BOM, because >> IE does not seem to cache encoding information sent this way. And so >> your document would be misinterpreted if you used the back button. For >> XML, by contrast, then the situation seems better than for 'text/html' >> - perhaps because XML defaults to either UTF-8 or UTF-16.) >> >> As I pondered over this, I first considered that 'unicode' and >> 'unicodeFFFE' had to be registered as aliases for 'UTF-16'. However, >> the fact that each charset supports only one 'subset' of UTF-16 (which >> is a single charset/encoding), meant that it had to be two charsets, if >> the Microsoft reality is taken as basis. >> >> That said: We should support reality, and not Microsoft reality. And in >> that regard: Because both charsets include the BOM, it is simple to >> treat them as aliases for 'UTF-16' - it is only when you create an >> invalid UTF-16 encoding (that is: you omit the BOM) that legacy IE >> risks acting up. (IE always consider BOM before anything else - even >> before HTTP Content-Type, it seems.) >> >> So there are actually two possibilities here: EITHER to update the >> 'UTF-16' registration to also cover 'unicode' and 'unicodeFFFE' - then >> we would also pretty much automatically discourage their use as there >> would be a clear recommendation in place to use the preferred name - >> 'UTF-16' - instead. OR, the other option: To register them as two >> separate charsets. >> >> Making the two labels into aliases of 'UTF-16' would - formally - give >> them a more prominent status than registering them independently for >> 'limited use'. To register them as aliases, would be to *not* base them >> on 'Microsoft reality'. Such a thing could perhaps make Microsoft align >> itself more with 'UTF-16' as she is registered? Another problem with >> registering them as independent charsets, would be that it would be >> more unclear how non-Microsoft products should handle them. Does anyone >> know if IE10 is behaving any differently w.r.t. UTF-16? Is there a >> direction towards the standard? >> >> To update the UTF-16 registration seems simple - only a matter of >> adding the aliases: >> <http://www.iana.org/assignments/charset-reg/UTF-16>. So I have started >> to, again, consider that the best option. >> >> I had planned to send the registrations now, but I would like to gather >> some responses first. However, if the expert reviewers would like, I >> could post the registrations that I have prepared ASAP - often it is >> better to have something concrete to look at. (That being said, I have >> covered very many of the issues in this message ...) >> >> With regards, >> Leif H Silli >>