23 Sep 2009 16:25
Re: utf8u tag proposal
> -----Original Message----- > From: William Spitzak [mailto:spitzak@...] > Sent: Tuesday, September 22, 2009 12:54 PM > Subject: Re: [Yaml-core] utf8u tag proposal > Osamu TAKEUCHI wrote: > > > Since utf8u data can not be safely stored in a utf8 string > variable, > > the raw data is stored in byte array. > > I think you said "utf8" when you meant "Unicode". At least I > am going to hope so! > > You seem to have this idea that it is not "unicode" when it > is stored in a byte array. But somehow storing it in a 16-bit > word array makes it "Unicode", even though it is still > variable length and can still contain invalid sequences! > > "utf8u" data CAN be "safely" stored in a "utf8" string, > because it *IS* a UTF-8 string! They are arrays of bytes! > Claiming "invalid sequences" > somehow makes it "not be UTF-8" is like claiming that > misspelled words makes it not be UTF-8. I don't know about a "utf8 string variable". The idea that a variable is a UTF-8 string is only conceptual in most languages. That is, the programmer knows that it is a UTF-8 encoded sequence of Unicode code points, but the programming language has no clue about this. Which is why, as you point out William, it is easy and safe to continue to store it in a byte array and to manipulate it as such. Myself, I would say that it is unsafe, however, if the concept is that it is a UTF-8 encoded string. The reason being that the programmer knows it is supposed to be UTF-8 encoded data and so he or she might pass it to an API that ~will~ validate it and (in most languages) throw an exception or raise an error if it is invalid. The key point being that the "conceptual" type is that it is a UTF-8 encoded string. So, unless the programmers have agreed that the conceptual type might include invalid characters and need special processing, it is quite dangerous in my opinion. Just saying "utf8 string" or some such doesn't communicate this. Notice, though, that this is entirely conceptual. The reason it is a problem is because the programmer is assigning extra meaning to the type than what the programming language does, which we always do. Indeed, we ~must~ do this in order to provide meaningful information from raw data. It would be reasonable to expect, in a well written application, that the data, being stored in a byte array (or language equivalent), would at some point be validated and translated into a valid sequence that can then be safely used with any API that is expecting a valid UTF-8 sequence (even if just in a temporary variable). Hence why I assign value to distinguishing, conceptually, between a valid UTF-8 string and a sequence of bytes that is expected to be a UTF-8 string, but still needs validated. This is, however, completely an application side issue, not one for the YAML specification or a YAML processor to decide (though a YAML processor might aid in the process). It is important for the application to know whether or not the data may contain bytes that don't form valid UTF-8 sequences (which is why we needed to create a separate data type, not just an encoding), but it is not the place of a general purpose YAML processor to decide what to do with those sequences (unless specifically directed to by the application). I guess my point is that we need to be careful when saying "UTF-8" in this thread about whether we mean validated UTF-8 that can be safely passed to APIs expecting Unicode code points (encoded using UTF-8) or whether we mean a byte sequence that we expect to be UTF-8, but might not be a valid UTF-8 sequence. <original text rearranged to correspond to my response> > The ONLY reason for elevating "invalid sequences" to this > magical importance is the ulterior motive of making sure it > is impossible to use UTF-8, perhaps to validate your previous > decision to use "wide characters" and an unwillingness to > admit that it was a mistake. > I have to say I am absolutly floored and shocked at the > failure for obviously intelligent people to understand this. > For some reason UTF8 turns geniuses into morons, they > suddenly act as though everything that works with byte arrays > is forgotten, or that there is circuitry so that their > program will crash the moment you store an invalid byte > sequence, when in fact invalid byte sequences are trivial to > detect and can be stored losslessly. > > THINK! Imagine it is binary data. Would you go through such > elaborate difficulty trying to make sure that the data when > read or written was some legal form? Or would defer this > until after the binary data is loaded in memory and let the > code that interprets it figure this out? > What is so magical about UTF-8 that this cannot be done? > > Or pretend the characters are words in ASCII text, and that > invalid byte sequences are misspelled words. Imagine how you > would write the program if some strings contained misspelled > words, and try applying the same ideas to UTF-8. That was rather presumptuous. It assumes your concept of what a "utf8 string" is. Most of us wouldn't call it a "utf8 string" until it has been validated (until we knew that we could safely pass it to an API that requires a series of UTF-8 encoded Unicode code points). This goes back to the difference between a byte array that we know to be UTF-8 encoded and expect to contain Unicode code points, but that has not been validated and may still contain invalid data that we will need to deal with, and a byte array that we have validated and know to contain only UTF-8 encoded Unicode code points. There's no point in getting upset over it. It's a difference of opinion as to when we call it a UTF-8 string (before or after we've validated the byte sequence). In any case, condemnations and condescending remarks like these one won't help in building a consensus and have, in my opinion, done a lot to hurt this process. > > This class is only used > > temporarily after the data is input from user and before it is > > validated to be a valid utf8 string for the future use. > > NO! This class is PERMANENT. I will convert all string > ***TO*** this "utf8u" form. Translation to glyphs is the > "temporary" storage and is the one I do NOT want in the file > and I will NOT put in my data structure!!! Translation to > glyphs is lossy (because I will not throw an error but > instead use replacement characters) so it cannot be used to > store anything! In your current application (or typical application), yes. Most applications would, as is their prerogative, want to convert to the native string type as soon as possible. A general purpose library should, of course, always make the original byte sequence available, translating it to the native string type only if it can safely do so in all cases (where the native string type is essentially a byte array) or it has been explicitly asked to by the application. Even if the native string type is essentially a byte array, a good library would still have to be careful to ensure that the application knows it is getting data that has not been validated (as UTF-8 encoded Unicode code points). Osamu's remarks didn't indicate whether he thought the library or the application would translate his example class into a string. Given the members that he exposed, I'm thinking that class was what the library would provide to the application and then the application would translate if it wanted to. If so, then this would be a good behavior for a general purpose library (it lets the application decide whether or not to translate the data while also exposing an easy way to do so). > > This indeed declares a different type from !!str. > > NO NO NO NO NO NO!!!! I will convert all strings TO "utf8u" > so they are IDENTICAL!!! It is a different type on the YAML end. Even if you convert them to and from the same native type in your application (or a library does so on your behalf) and/or don't use the !!str tag within the YAML presentation of that type. It is absolutely fundamental to the way YAML works that it defines its own data types. To be platform independent, it must, and does, define a common set of types (and also allows users to define their own types). What's more is, there is not always a one to one relationship between YAML types and native types. To demonstrate this last point, I'll show how I would map some basic types if I were writing my own YAML processor for C# (you'll need to view this with a fixed width font): +-----------+---------------+ | C# Type | YAML Type/Tag | +-----------+---------------+ | bool | !!bool | +-----------+---------------+ | byte | !!int | | sbyte | | | short | | | ushort | | | int | | | uint | | | long | | | ulong | | +-----------+---------------+ | float | !!float | | double | | | decimal | | +-----------+---------------+ | string | !!str | | | !!utf8u | | | !!null | +-----------+---------------+ | byte[] | !!binary | | | !!utf8u | | | !!null | +-----------+---------------+ It doesn't take long to see that there are multiple problems (and I didn't even get into more complex types). How to map to and from native types is a problem that the YAML processor and application have to cooperate on. It is beyond the scope of the YAML specification or the definition of YAML tags. Oren has mentioned the idea of a schema language several times, and that is a way to go about it for more general purpose applications. In any case, you might note that I list !!utf8u twice. There's a good reason for that. Firstly, I do it because a .NET string might contain invalid surrogate pairs that, if we want to preserve them without violating the YAML specification, would require storage using the utf8u type rather than the str type (though most applications would prefer the str type whenever possible). Secondly, on reading data in, the application may want the processor to translate utf8u scalars into strings whenever it can or it might even ask the library to translate invalid UTF-8 sequences in some way. Otherwise, the processor needs to use a byte array to store the original data as an unvalidated UTF-8 byte sequence. That means it also needs a way to communicate this fact to the application. Using Osamu's example class to encapsulate the byte array would be one way to go about this. My basic point, though, is that there isn't a one to one relationship between native types and YAML types. There is, in fact, a many to many relationship between native types. This isn't simply a matter of how the native type is encoded either. As stated, YAML needs to define a set of common types to be platform independent. Furthermore, it needs to define constraints on those types for them to be useful, which is why !!utf8u is different than !!str. !!str communicates a constraint that doesn't exist on !!utf8u. That constraint allows mapping !!str scalars to native string types in many cases where !!utf8u can not be (where !!utf8u must be treated as a byte array that needs decoded). As an aside, I'm still interested in the idea of encoding tags. Although !!utf8u might not be a valid candidate for an encoding tag (because the data it represents doesn't satisfy the constraints of the str type), I can see other cases where it might be useful, such as Osamu's earlier line feed problem. In general, however, I think some method for providing hints about the original data type would be more useful (was that !!null node an object, a string, a byte array, or what?). Schemas, whether defined implicitly or explicitly, can solve both of these issues, however. (I guess I've joined the schema bandwagon =p). Whew!!! That got longer than I intended... my apologies to all for another long message ^^. Please chalk it up to my attempt to be exact in my meanings. ------------------------------------------------------------------------------ Come build with us! The BlackBerry® Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9-12, 2009. Register now! http://p.sf.net/sfu/devconf
RSS Feed