1 Mar 2009 23:17
plea for NFKC & case-folding, suggestions for definitions
Adam M. Costello <idna-update.amc+0+ <at> nicemice.net.RemoveThisWord>
2009-03-01 22:17:08 GMT
2009-03-01 22:17:08 GMT
I have not had time to follow the progress of this working group, but I have now read the latest draft-ietf-idnabis-{defs,protocol,rationale,tables}, and I have a few high-level comments. 1) I am very happy that an internationalized generalization of preferred-syntax (host name) labels is being defined, based on the principle of including what's needed, rather than excluding only what's clearly useless/harmful. I wanted to work on this in the first IDN working group, instead of or in addition to the wide-open IDNs of IDNA2003, but the rough consensus then was that it was not worth the delay it would cause. 2) I am not persuaded that IDNAbis can avoid requiring the fundamentals of Nameprep: NFKC and case-folding. More on this below. 3) I think the approach taken in the definitions section, of building up the smaller concepts involved in the ACE architecture, is better than the approach I took in RFC 3490--referring to complex multi-step operations ToASCII and ToUnicode as primitives. The small-concepts approach allows the reader to develop some intuition. I have some suggestions for more concise and rigorous definitions following that approach (see below). Regarding NFKC and case-folding I think the rationale draft is trying to have it both ways. It says a prefix change would be required if a label that is valid in both IDNA2003 and IDNAbis is represented by different ASCII forms in the two protocols. To avoid triggering that incompatibility, it defines non-normalized and non-case-folded strings as "invalid". But then it admits (in section "Front-end and User Interface Processing for Lookup" in the rationale doc) that in many cases "some local processing of apparent domain name strings will be required, both to maintain compatibility with IDNA2003 and to prevent user astonishment". In practice applications will often have no choice but to accept non-normalized and/or non-case-folded strings and apply "local processing", for which there is no standard, only suggestions of either "generic preprocessing" (which is basically parts of IDNA2003 that are not specified in IDNAbis and not even included by reference) or "highly localized preprocessing", which is completely unspecified but won't be "a threat to interoperability as long as (i) only U-labels and A-labels are used in interchange with systems outside the local environment..." But domain names are global identifiers, and they get exchanged willy-nilly by humans typing and cutting and pasting them. My opinion: We can't have it both ways. Breaking compatibility in isolated well-considered corner cases involving a few code points (like the zero-width joiner) is one thing, but breaking compatibility with giant swaths of the namespace, like all non-normalized names and all names containing uppercase characters, and not changing the prefix, would be antithetical to the purpose of the IETF (interoperability) and would betray the trust of the internet community. Since changing the prefix is disallowed by the IDNAbis charter, some Nameprep-like requirement needs to be specified in IDNAbis, based on NFKC and case-folding. Regarding the definitions: Below are some definitions of various kinds of labels, and some key observations (useful theorems) implied by the definitions. They follow the same general small-concepts approach as in the defs draft, but I've tried to make them a little more concise and rigorous. Draft editors might want to incorporate or draw inspiration from them. These definitions assume that something Nameprep-like is required in the protocol. ============ Definitions: [[Editorial notes are in double square brackets.]] A "string" is a sequence of Unicode code points (not bytes). The relationship between code points and bytes is beyond the scope of IDNA. Conversion of non-Unicode text to/from Unicode is beyond the scope of IDNA. An "ASCII code point" is a Unicode code point in the range 0..7F. An "LDH code point" is an ASCII code point that is a letter (41..5A, 61..7A), digit (30..39), or hyphen-minus (U+002D). An "ASCII string" is a string that contains only ASCII code points (or is empty). Note that a non-ASCII string can contain some ASCII code points. The "canonical form" of a string is the output of a particular canonicalization function from strings to strings. The function does not always produce an output; sometimes it fails (for example, because the input string contains a disallowed code point). A string for which the function fails has no canonical form. The canonicalization function is idempotent; that is, re-applying it to its own output yields the same output. The full details of the canonicalization function are specified elsewhere, but it is worth noting here that its treatment of ASCII code points agrees the rules for validating and comparing ASCII host name labels [RFCs 921, 952, 1123]: If the input contains only LDH code points (or is empty) and neither begins nor ends with hyphen-minus, the function succeeds. If the input contains any non-LDH ASCII code points, or if it begins or ends with hyphen-minus, the function fails. The function replaces uppercase ASCII letters with the corresponding lowercase ASCII letters, and leaves other ASCII code points unchanged. [[The canonicalization function is the IDNAbis analog of Nameprep, the function all applications use for encoding/decoding ACE and validating & comparing internationalized labels. It excludes any extra restrictions that are enforced only by registries.]] A "canonical string" is a string that has a canonical form and is equal to it. A "tagged string" is a string that has a canonical form with U+002D (hyphen-minus) as its 3rd and 4th code points. The first four code points of the canonical form are the "tag". Two strings are an "XN pair" iff they satisfy all of the following properties: 1) Both are canonical strings. 2) One is a an ASCII string and the other is a non-ASCII string. 3) The ASCII string begins with the tag "xn--". 4) The non-ASCII string is non-tagged. [[Requiring the non-ASCII string to be non-tagged is stronger than RFC-3490, which required only that it not begin with "xn--". That was probably an oversight. Since IDNAbis is tightening up all sorts of things, it might as well tighten up this too.]] 5) If the ASCII string minus its tag is fed to a Punycode decoder, the result is the non-ASCII string. Equivalently, if the non-ASCII string is fed to a Punycode encoder that outputs lowercase forms, the result is the ASCII string minus its tag. Punycode is specified elsewhere. If a string is a member of an XN pair, its "XN partner" is the other member of the pair. (It can be shown that a string can belong to at most one XN pair and therefore has at most one XN partner.) A "domain label" is a component of a domain name, or something that could be (by virtue of its syntax) a component of a domain name. For example, the domain name "www.example.com" (which can in some contexts be written with a trailing dot: "www.example.com.") is composed of three domain labels: "www", "example", and "com". In some contexts there is said to be a fourth domain label, the empty root label. In some contexts domain labels can contain non-text (like binary data). A "text label" is a domain label that is non-empty and contains only text. Domain labels that are not text labels are outside the scope of IDNA. An "ASCII label" is an ASCII string with at least 1 and at most 63 code points. An "LDH label" is an ASCII label that has a canonical form (that is, it contains only LDH code points and neither begins nor ends with hyphen-minus). Because domain labels intended for human consumption have generally been LDH labels, this is the class of domain labels that IDNA extends. ASCII labels that have no canonical form (like "_tcp" [RFC-2782]) are outside the scope of IDNA. An "internationalized label" is a generalization of LDH label: It is a string that has a canonical form that is, or is the XN partner of, an LDH label. Two internationalized labels are "equivalent" iff they have canonical forms that are identical or are an XN pair. A "tagged label" is an internationalized label that is a tagged string. An "ACE label" is an internationalized label whose canonical form is the ASCII member of an XN pair. (ACE stands for ASCII Compatible Encoding.) A "Joker label" is a tagged label that is not an ACE label. Within the IDNA specifications the unqualified term "label" means internationalized label, but this abbreviation is avoided wherever it might be confusing. ================= Key observations: Within the set of all internationalized labels equivalent to any given internationalized label, at most two are canonical. If there are two, they are an XN pair. For every non-ASCII internationalized label, there exists at least one equivalent internationalized label that is ASCII. (Proof by construction: If the canonical form of an internationalized label is not ASCII, then it has an XN partner that is ASCII.) ASCII forms are needed in some protocols (like DNS). For every ACE label, there exists at least one equivalent internationalized label that is non-tagged and therefore non-ACE. (Proof by construction: An ACE label's canonical form has an XN partner that is non-tagged.) The non-ACE form is much more user-friendly, because ACE labels contain Punycode-encoded text, which looks like garbage. Every internationalized label has exactly one canonical ASCII form. Two internationalized labels are equivalent iff their canonical ASCII forms are identical. Every internationalized label has exactly one canonical non-ACE form. Two internationalized labels are equivalent iff their canonical non-ACE forms are identical. The canonical form of a tagged label is always ASCII, because all non-ASCII canonical tagged strings fail to qualify as internationalized labels. Joker labels are tagged labels that fail to satisfy the properties of an XN pair. For example, "aa--foo" has the wrong tag, "xn--3" fails in the Punycode decoder, and "xn--aa--foo-" fails to have a non-ASCII non-tagged partner (the would-be partner produced by the Punycode decoder is "aa--foo", which is both ASCII and tagged). Since most of the definitions are in terms of canonical forms, it can be instructive to categorize the set of canonical internationalized labels as nested subsets: +-----------------------------------------+ | canonical internationalized labels | | | | +-----------------------------------+ | | | canonical ASCII labels | | | | all of which are | | | | canonical LDH labels | | | | | | | | +-----------------------------+ | | | | | canonical tagged labels | | | | | | | | | | | | +-----------------------+ | | | | | | | canonical ACE labels | | | | | | | +-----------------------+ | | | | | +-----------------------------+ | | | +-----------------------------------+ | +-----------------------------------------+ There is a one-to-one correspondence, defined by XN pairs, between the outermost ring (the non-ASCII canonical internationalized labels) and the innermost set (the canonical ACE labels). The other two rings are not involved in any XN pairs. The Joker labels are the third ring (canonical tagged non-ACE labels). It can also be instructive to see how internationalized labels relate to the broader universe of domain labels, via another series of nested subsets along a different axis: +------------------------------+ | labels | | | | +------------------------+ | | | text labels | | | | | | | | +------------------+ | | | | | ASCII labels | | | | | | | | | | | | +------------+ | | | | | | | LDH labels | | | | | | | +------------+ | | | | | +------------------+ | | | +------------------------+ | +------------------------------+ IDNA supplies the second ring (non-ASCII text labels); before IDNA, all text labels were ASCII. The scope of IDNA is the set of internationalized labels, which includes the LDH labels and the non-ASCII text labels, but not the intervening ring of non-LDH ASCII labels. ======== AMC
RSS Feed