UAX 31 Changes
L2/...
From: Mark Davis
Date: 2009-3-28
I suggest the following changes in UAX 31.
1. Fix ambiguous variables
There are suggested rules for using ZWJ and ZWNJ in http://unicode.org/draft/reports/tr31/tr31.html#Layout_and_Format_Control_Characters
In those rules, we use the variable $L for two different entities in the rules: Left Joining, and Letter (for Indic). While they are in separate contexts, it would be much clearer if we didn't have the overlap. There are a few possible alternatives; I suggest:
For the Joining specifications of ZWJ/ZWNJ, change $L, $R to $LJ, $RJ
2. Add Default Ignorable Code Points to Table 4 Candidate Characters for Exclusion from Identifiers
In http://unicode.org/draft/reports/tr31/tr31.html#Specific_Character_Adjustments,
add a row:
[:Default_Ignorable_Code_Point=True:] Default Ignorable Code Points (See Section 2.3)
[Rationale: we already say that DIs should be excluded, with certain exceptions in Section 2.3, which has a lot of detail on the topic. This just makes that relationship more visible.]
3. Add Unicode 5.2 Characters to Table 3/4 (Candidates for Inclusion/Exclusion)
Add to Table 4 (Exclusion) the following scripts (this is a rough cut, so feedback is welcome):
Archaic / Historic
Old Turkic
Old South Arabian
Imperial Aramaic
Inscriptional Parthian
Inscriptional Pahlavi
Avestan
Egyptian Hieroglyphs
Javanese
Limited Use
Samaritan
Kaithi
Tai Viet
Bamum
Lisu
Add the following to Table 5. Recommended Scripts
Meetei Mayek
Tai Tham
We have the following tables in http://unicode.org/draft/reports/tr31/tr31.html#Specific_Character_Adjustments
Table 3. Candidate Characters for Inclusion in Identifiers
Table 4. Candidate Characters for Exclusion from Identifiers
A. I suggest adding a row to Table 4, being
[\u0640] Arabic Tatweel
B. Alternatively, one could break Table 4 into two tables:
Table 4a. Candidate Characters Identified by CodePointfor Exclusion from Identifiers
Containing only Tatweel
Table 4b. Candidate Characters Identified by Property for Exclusion from Identifiers
Containing the current Table 4 contents
(Ken favors a two table solution; I think it is simpler with one.)
5. Add Characters from IDNA Tables Document
The IDNA tables document (draft) contains certain exceptions that we should review, in http://tools.ietf.org/html/draft-ietf-idnabis-tables#section-2.6.
The following characters are not in the Unicode identifier definition XID_Continue (after subtracting characters that are affected by case folding and NFKC), nor are in the Candidates for Inclusion.
Greek And Coptic - Numeral signs
U+0375 ( ͵ ) GREEK LOWER NUMERAL SIGN
Arabic - Signs for Sindhi
U+06FD ( ۽ ) ARABIC SIGN SINDHI AMPERSAND
U+06FE ( ۾ ) ARABIC SIGN SINDHI POSTPOSITION MEN
Tibetan - Marks and signs
U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG
Katakana - Conjunction and length marks
U+30FB ( ・ ) KATAKANA MIDDLE DOT
Of them, I'd recommend that we add U+30FB ( ・ ) KATAKANA MIDDLE DOT to Table 3. Candidate Characters for Inclusion in Identifiers, since it serves a function somewhat like an underbar. The others have gotten into the IDNA specification (draft), but there doesn't seem to be any compelling rationale for that. However, others may know more about them and present good reasons for inclusion into UAX#31.
Note that the following is part of Pattern_Syntax, and thus not part of XID_Continue. Pattern_Syntax is immutable, and required to be disjoint from identifiers, and yet this character was added in that range, which was probably a mistake.
Supplemental Punctuation - Medievalist punctuation
U+2E2F ( ⸯ ) VERTICAL TILDE
Of the characters that Unicode has, and IDNA doesn't, I don't see any need to make any changes. Some of them are principled differences, like the omission of connector punctuation, and others are not, like the omission of Hangul Jamo.
5.1 Background
For completeness, the following lists the exceptions in the 05 version of that document, organized by type.
*PVALID: // would otherwise have been DISALLOWED
00DF; PVALID # LATIN SMALL LETTER SHARP S
03C2; PVALID # GREEK SMALL LETTER FINAL SIGMA
06FD; PVALID # ARABIC SIGN SINDHI AMPERSAND
06FE; PVALID # ARABIC SIGN SINDHI POSTPOSITION MEN
0F0B; PVALID # TIBETAN MARK INTERSYLLABIC TSHEG
3007; PVALID # IDEOGRAPHIC NUMBER ZERO
*CONTEXTO: // would otherwise have been DISALLOWED
00B7; CONTEXTO # MIDDLE DOT
0375; CONTEXTO # GREEK LOWER NUMERAL SIGN (KERAIA)
05F3; CONTEXTO # HEBREW PUNCTUATION GERESH
05F4; CONTEXTO # HEBREW PUNCTUATION GERSHAYIM
30FB; CONTEXTO # KATAKANA MIDDLE DOT
*CONTEXTO: // would otherwise have been PVALID
002D; CONTEXTO # HYPHEN-MINUS
02B9; CONTEXTO # MODIFIER LETTER PRIME
0660; CONTEXTO # ARABIC-INDIC DIGIT ZERO
0661; CONTEXTO # ARABIC-INDIC DIGIT ONE
0662; CONTEXTO # ARABIC-INDIC DIGIT TWO
0663; CONTEXTO # ARABIC-INDIC DIGIT THREE
0664; CONTEXTO # ARABIC-INDIC DIGIT FOUR
0665; CONTEXTO # ARABIC-INDIC DIGIT FIVE
0666; CONTEXTO # ARABIC-INDIC DIGIT SIX
0667; CONTEXTO # ARABIC-INDIC DIGIT SEVEN
0668; CONTEXTO # ARABIC-INDIC DIGIT EIGHT
0669; CONTEXTO # ARABIC-INDIC DIGIT NINE
06F0; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT ZERO
06F1; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT ONE
06F2; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT TWO
06F3; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT THREE
06F4; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT FOUR
06F5; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT FIVE
06F6; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT SIX
06F7; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT SEVEN
06F8; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT EIGHT
06F9; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT NINE
0483; CONTEXTO # COMBINING CYRILLIC TITLO
3005; CONTEXTO # IDEOGRAPHIC ITERATION MARK
303B; CONTEXTO # VERTICAL IDEOGRAPHIC ITERATION MARK
*DISALLOWED: // would otherwise have been PVALID
302E; DISALLOWED # HANGUL SINGLE DOT TONE MARK
302F; DISALLOWED # HANGUL DOUBLE DOT TONE MARK
5.2 Characters in IDNA draft
Here is the current set, as of the current draft and Unicode 5.1. You can paste into http://unicode.org/cldr/utility/list-unicodeset.jsp to explore, or compare against XID_Continue.
[\-0-9a-z·ß-öø-ÿāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıĵķĸĺļľłńņňŋōŏőœŕŗřśŝşšţťŧũūŭůűųŵŷźżžƀƃƅƈƌƍƒƕƙ-ƛƞơƣƥƨƪƫƭưƴƶƹ-ƻƽ-ǃǎǐǒǔǖǘǚǜǝǟǡǣǥǧǩǫǭǯǰǵǹǻǽǿȁȃȅȇȉȋȍȏȑȓȕȗșțȝȟȡȣȥȧȩȫȭȯȱȳ-ȹȼȿɀɂɇɉɋɍɏ-ʯʹ-ˁˆ-ˑˬˮ̀-̿͂͆-͎͐-ͯͱͳ͵ͷͻ-ͽΐά-ώϗϙϛϝϟϡϣϥϧϩϫϭϯϳϸϻϼа-џѡѣѥѧѩѫѭѯѱѳѵѷѹѻѽѿҁ҃-҇ҋҍҏґғҕҗҙқҝҟҡңҥҧҩҫҭүұҳҵҷҹһҽҿӂӄӆӈӊӌӎӏӑӓӕӗәӛӝӟӡӣӥӧөӫӭӯӱӳӵӷӹӻӽӿԁԃԅԇԉԋԍԏԑԓԕԗԙԛԝԟԡԣՙա-ֆ֑-ׇֽֿׁׂׅׄא-תװ-״ؐ-ؚء-ٞ٠-٩ٮ-ٴٹ-ۓە-ۜ۟-۪ۨ-ۿܐ-݊ݍ-ޱ߀-ߵߺँ-ह़-्ॐ-॔ॠ-ॣ०-९ॱॲॻ-ॿঁ-ঃঅ-ঌএঐও-নপ-রলশ-হ়-ৄেৈো-ৎৗৠ-ৣ০-ৱਁ-ਃਅ-ਊਏਐਓ-ਨਪ-ਰਲਵਸਹ਼ਾ-ੂੇੈੋ-੍ੑੜ੦-ੵઁ-ઃઅ-ઍએ-ઑઓ-નપ-રલળવ-હ઼-ૅે-ૉો-્ૐૠ-ૣ૦-૯ଁ-ଃଅ-ଌଏଐଓ-ନପ-ରଲଳଵ-ହ଼-ୄେୈୋ-୍ୖୗୟ-ୣ୦-୯ୱஂஃஅ-ஊஎ-ஐஒ-கஙசஜஞடணதந-பம-ஹா-ூெ-ைொ-்ௐௗ௦-௯ఁ-ఃఅ-ఌఎ-ఐఒ-నప-ళవ-హఽ-ౄె-ైొ-్ౕౖౘౙౠ-ౣ౦-౯ಂಃಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹ಼-ೄೆ-ೈೊ-್ೕೖೞೠ-ೣ೦-೯ംഃഅ-ഌഎ-ഐഒ-നപ-ഹഽ-ൄെ-ൈൊ-്ൗൠ-ൣ൦-൯ൺ-ൿංඃඅ-ඖක-නඳ-රලව-ෆ්ා-ුූෘ-ෟෲෳก-าิ-ฺเ-๎๐-๙ກຂຄງຈຊຍດ-ທນ-ຟມ-ຣລວສຫອ-າິ-ູົ-ຽເ-ໄໆ່-ໍ໐-໙ༀ་༘༙༠-༩༹༵༷༾-གང-ཇཉ-ཌཎ-དན-བམ-ཛཝ-ཨཪ-ཬཱིེུ-ྀྂ-྄྆-ྋྐ-ྒྔ-ྗྙ-ྜྞ-ྡྣ-ྦྨ-ྫྭ-ྸྺ-ྼ࿆က-၉ၐ-႙ა-ჺሀ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚ፟ᎀ-ᎏᎠ-Ᏼᐁ-ᙬᙯ-ᙶᚁ-ᚚᚠ-ᛪᜀ-ᜌᜎ-᜔ᜠ-᜴ᝀ-ᝓᝠ-ᝬᝮ-ᝰᝲᝳក-ឳា-៓ៗៜ៝០-៩᠐-᠙ᠠ-ᡷᢀ-ᢪᤀ-ᤜᤠ-ᤫᤰ-᤻᥆-ᥭᥰ-ᥴᦀ-ᦩᦰ-ᧉ᧐-᧙ᨀ-ᨛᬀ-ᭋ᭐-᭙᭫-᭳ᮀ-᮪ᮮ-᮹ᰀ-᰷᱀-᱉ᱍ-ᱽᴀ-ᴫᴯᴻᵎᵫ-ᵷᵹ-ᶚ᷀-᷿ᷦ᷾ḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝḟḡḣḥḧḩḫḭḯḱḳḵḷḹḻḽḿṁṃṅṇṉṋṍṏṑṓṕṗṙṛṝṟṡṣṥṧṩṫṭṯṱṳṵṷṹṻṽṿẁẃẅẇẉẋẍẏẑẓẕ-ẙẜẝẟạảấầẩẫậắằẳẵặẹẻẽếềểễệỉịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹỻỽỿ-ἇἐ-ἕἠ-ἧἰ-ἷὀ-ὅὐ-ὗὠ-ὧὰὲὴὶὸὺὼᾰᾱᾶῆῐ-ῒῖῗῠ-ῢῤ-ῧῶⅎↄⰰ-ⱞⱡⱥⱦⱨⱪⱬⱱⱳⱴⱶ-ⱻⲁⲃⲅⲇⲉⲋⲍⲏⲑⲓⲕⲗⲙⲛⲝⲟⲡⲣⲥⲧⲩⲫⲭⲯⲱⲳⲵⲷⲹⲻⲽⲿⳁⳃⳅⳇⳉⳋⳍⳏⳑⳓⳕⳗⳙⳛⳝⳟⳡⳣⳤⴀ-ⴥⴰ-ⵥⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⷠ-ⷿⸯ々-〇〪-〭〱-〵〻〼ぁ-ゖ゙゚ゝゞァ-ヾㄅ-ㄭㆠ-ㆷㇰ-ㇿ㐀-䶵一-鿃ꀀ-ꒌꔀ-ꘌꘐ-ꘫꙁꙃꙅꙇꙉꙋꙍꙏꙑꙓꙕꙗꙙꙛꙝꙟꙣꙥꙧꙩꙫꙭ-꙯꙼꙽ꙿꚁꚃꚅꚇꚉꚋꚍꚏꚑꚓꚕꚗꜗ-ꜟꜣꜥꜧꜩꜫꜭꜯ-ꜱꜳꜵꜷꜹꜻꜽꜿꝁꝃꝅꝇꝉꝋꝍꝏꝑꝓꝕꝗꝙꝛꝝꝟꝡꝣꝥꝧꝩꝫꝭꝯꝱ-ꝸꝺꝼꝿꞁꞃꞅꞇꞈꞌꟻ-ꠧꡀ-ꡳꢀ-꣄꣐-꣙꤀-꤭ꤰ-꥓ꨀ-ꨶꩀ-ꩍ꩐-꩙가-힣﨎﨏﨑﨓﨔﨟﨡﨣﨤﨧-﨩ﬞ︠-︦ﹳ𐀁-𐀋𐀍-𐀦𐀨-𐀺𐀼𐀽𐀿-𐁍𐁐-𐁝𐂀-𐃺𐇽𐊀-𐊜𐊠-𐋐𐌀-𐌞𐌰-𐍀𐍂-𐍉𐎀-𐎝𐎠-𐏃𐏈-𐏏𐐨-𐒝𐒠-𐒩𐠀-𐠅𐠈𐠊-𐠵𐠷𐠸𐠼𐠿𐤀-𐤕𐤠-𐤹𐨀-𐨃𐨅𐨆𐨌-𐨓𐨕-𐨗𐨙-𐨳𐨸-𐨿𐨺𒀀-𒍮𠀀-𪛖]