6 Aug 2012 04:31
confusing notation in the ZERO WIDTH NON-JOINER contextual rule
debug <at> test1.org <debug <at> test1.org>
2012-08-06 02:31:27 GMT
2012-08-06 02:31:27 GMT
Hi,
RFC5892 contains the following rule about the contextual validity of U+200C:
> If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
> (Joining_Type:T)*(Joining_Type:{R,D})) Then True;
By intuition, I understand that "\u200C" within the regex means the code
point in question. So, a feasible interpretation would be:
(*) The code point MUST occur between Joining_Type:{L,D} and
Joining_Type:{R,D}, where arbitrary occurences of Joining_Type:T MAY be
in between.
On the other hand, the statement literally defines just a regex that
should match the string somewhere (with no reference to "cp" as in other
rules), such that the rule would be satisfied already if any U+200C
fulfill the requirement.
The literally interpretation sounds stupid, but I found both variants
within IDNA2008 implementations.
For instance, consider the Perl module Net::IDN::UTS46 on CPAN. Here,
it's taken literally and hence the sequence
U+0628 U+200C U+0627 U+200C U+0627
is considered to be valid, although U+0627 is Joining_Type:R and thus
the second U+200C doesn't meet the requirement (*).
On the other hand, the (probably more reliable) implementation idnkit-2
from the Japan Registry reports a CONTEXTJ rule violation for the same
string. Now, who is right?
regards, Sebastian
RSS Feed