(Moved here unchanged from defunct http://www.mindspring.com/~markus.scherer/unicode/es4-unicode.html)
Markus W. Scherer
Since ECMAScript edition 3 was published, the Unicode standard has added features, fixed bugs, and improved specifications. This document is a proposal to update Unicode-related chapters in ECMAScript edition 4 accordingly.
References to chapters refer to the current, draft of edition 4 (draft from 2003-jan-13).
Chapter 6 requires that ECMAScript implementations use Unicode version 2.1 or higher.
Proposal: To require implementations to use Unicode version 3.0 or higher.
The higher the Unicode baseline version, the more characters can be used reliably in identifiers etc. However, while the current version of Unicode is 3.2, it is expected that some ECMAScript implementations are written in Java, where only Unicode 3.0 is supported at this time. Requiring a higher version of Unicode would put an additional burden on such implementations. Unicode 3.0 was published in 1999.
Chapter 6.1 specifies that implementations strip ignorable characters from the script text before any further parsing. Ignorable characters are defined as those with a Unicode general category value of Cf (format controls).
Proposal: To change the set of ignorables from Cf to Default_Ignorable_Code_Point, which is a superset of Cf.
The Default_Ignorable_Code_Point property was introduced in Unicode 3.2. In addition to Cf characters, it contains further characters with similar control functions. It also covers ranges of unassigned code points that are reserved for future control characters. In particular, in addition to Cf characters, it lists:
For ECMAScript implementations that use Unicode versions before 3.2, the Default_Ignorable_Code_Point property should be implemented as in Unicode 3.2:
# Derived Property: Default_Ignorable_Code_Point # Generated from <2060..206F, FFF0..FFFB, E0000..E0FFF> # + Other_Default_Ignorable_Code_Point + (Cf + Cc + Cs - White_Space) 0000..0008 ; Default_Ignorable_Code_Point # Cc  <control>..<control> 000E..001F ; Default_Ignorable_Code_Point # Cc  <control>..<control> 007F..0084 ; Default_Ignorable_Code_Point # Cc  <control>..<control> 0086..009F ; Default_Ignorable_Code_Point # Cc  <control>..<control> 06DD ; Default_Ignorable_Code_Point # Cf ARABIC END OF AYAH 070F ; Default_Ignorable_Code_Point # Cf SYRIAC ABBREVIATION MARK 180B..180D ; Default_Ignorable_Code_Point # Mn  MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN FREE VARIATION SELECTOR THREE 180E ; Default_Ignorable_Code_Point # Cf MONGOLIAN VOWEL SEPARATOR 200C..200F ; Default_Ignorable_Code_Point # Cf  ZERO WIDTH NON-JOINER..RIGHT-TO-LEFT MARK 202A..202E ; Default_Ignorable_Code_Point # Cf  LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE 2060..2063 ; Default_Ignorable_Code_Point # Cf  WORD JOINER..INVISIBLE SEPARATOR 2064..2069 ; Default_Ignorable_Code_Point # Cn  206A..206F ; Default_Ignorable_Code_Point # Cf  INHIBIT SYMMETRIC SWAPPING..NOMINAL DIGIT SHAPES D800..DFFF ; Default_Ignorable_Code_Point # Cs  FE00..FE0F ; Default_Ignorable_Code_Point # Mn  VARIATION SELECTOR-1..VARIATION SELECTOR-16 FEFF ; Default_Ignorable_Code_Point # Cf ZERO WIDTH NO-BREAK SPACE FFF0..FFF8 ; Default_Ignorable_Code_Point # Cn  FFF9..FFFB ; Default_Ignorable_Code_Point # Cf  INTERLINEAR ANNOTATION ANCHOR..INTERLINEAR ANNOTATION TERMINATOR 1D173..1D17A ; Default_Ignorable_Code_Point # Cf  MUSICAL SYMBOL BEGIN BEAM..MUSICAL SYMBOL END PHRASE E0000 ; Default_Ignorable_Code_Point # Cn E0001 ; Default_Ignorable_Code_Point # Cf LANGUAGE TAG E0002..E001F ; Default_Ignorable_Code_Point # Cn  E0020..E007F ; Default_Ignorable_Code_Point # Cf  TAG SPACE..CANCEL TAG E0080..E0FFF ; Default_Ignorable_Code_Point # Cn  # Total code points: 6271
Cn is the general category value for unassigned code points.
Note that the above list segments contiguous ranges by general categories as usual for lists of Unicode properties. This is purely for documentation.
Chapter 7.3 defines a small number of characters as LineTerminator characters.
Proposal: To add the ISO Control U+0085 Next Line (NEL) to the list of LineTerminator characters.
Note that in edition 3 there was at least one other place where these characters were part of a production: Edition 3 chapter 9.3.1 ToNumber defines StrWhiteSpaceChar with what seems to be a union of WhiteSpace and LineBreak. (In edition 4, chapter 18.7.1 appears to be reserved for this.) I propose that such definitions be replaced with references to common definitions. This would make the standard more maintainable and avoid errors. Alternatively, each such definition should be updated in parallel.
Chapter 7.5 defines identifier characters based on Unicode general categories, equivalent to the Unicode ID_Start and ID_Continue properties. ECMAScript also requires that at least the Unicode 3.0 ID_Start and ID_Continue characters are recognized. This is for stability: Very occasionally, Unicode changes general category assignments to fix problematic cases, but this can cause some characters to be suitable for identifiers in some earlier version of Unicode but not the current, later one.
Proposal: To include the new Unicode property ID_Start_Exceptions into the definition of ECMAScript identifiers.
The ID_Start_Exceptions property is being created so that one can implement stable identifier rules effectively. It lists all characters that were ID_Start in some Unicode version (2.1 or higher) but are not in the current Unicode version.
For ECMAScript implementations that use Unicode versions before 4.0, the ID_Start_Exceptions property should be implemented as in Unicode 4.0 (PropList.txt, showing pre-beta data here):
2118 SCRIPT CAPITAL P 212E ESTIMATED SYMBOL 309B KATAKANA-HIRAGANA VOICED SOUND MARK 309C KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
Note that the Unicode 4.0 character database is expected to be final around April of 2003, in time for inclusion of this property into ECMAScript edition 4.
Chapter 7.8 defines HexEscape sequences (with alternatives \xhh and \uhhhh)
for string literals. They are limited to code point values for the BMP (<=U+FFFF).
When a developer needs to represent supplementary code points, they have to
manually calculate the two UTF-16 surrogate code units and put the two \uhhhh
sequences into the source code. This is tedious and error-prone. For example,
U+20001 would have to be written as
Proposal: To add escape sequences to ECMAScript that can directly represent supplementary Unicode code points (<=U+10FFFF).
Note that it is possible, but not necessary, to restrict such new escape sequences to only supplementary code points (10000..10FFFF but not <10000). ECMAScript permits \uhhhh for characters where \xhh would suffice, and other standards do not appear to have such restrictions.
There are several reasonable syntaxes for such escape sequences:
Of course, more than one of these could be allowed.
(HTML and XML use variable-width syntaxes
Semantically, an escape sequence for a supplementary code point would add two "characters" (ECMAScript terminology; two invocations of codeToCharacter()) to the string or regular expression etc.
Similarly to the comment about LineBreak above, there are several ECMAScript edition 3 productions that define escape sequences. Edition 4 appears to already re-use a single definition.
The Unicode standard has modified and clarified some of its terminology. For example, "Unicode values" and "scalar values" should be changed to "Unicode code points", the use of "surrogate" vs. "supplementary character" needs to be cleaned up, etc.
Proposal: To update Unicode-related text in the ECMAScript standard to reflect changes in Unicode terminology. This would be best done editing the chapters with change bars for review.