WCode - What Unicode could have been

(Moved here unchanged from defunct http://www.mindspring.com/~markus.scherer/unicode/wcode.html)

Markus W. Scherer, 2001-Mar-18

This is another Gedankenexperiment with Unicode (March seems to be a good month for those), this time for what Unicode itself could have been without most of the compromises that were necessary to make it successful.

This is not a serious proposal; it is purely intended for discussion, study, and comparison.

Introduction

I call this "WCode" to give it a new initial 'W' for derived definitions, just because 'W' is "double U" in English...

WCode is derived from Unicode, with most compromises against its founding principles removed. The one compromise that it keeps is that the encoding range is larger than 64k, since it makes it easier to define it in a useful way (with most of the characters in Unicode). Also, although Unicode Ideographic Description Sequences provide a way to encode CJKV ideographs with a small set of sub-ideographic characters, it would take a lot of analysis to apply this to the full set of ideographs.

Structure

- WCode encodes characters with the same principles as Unicode.
- Each character is assigned a unique code point 0..0xfffff (exactly 20 bits). (Unicode: 0..0x10ffff, 20.1 bits)
- WCode code points are written with "W-" followed by exactly 5 hexadecimal digits. (Unicode: "U+" followed by 4/5/6 hex digits)
- There are 16 "planes" with 64k code points each. (Unicode: 17 planes)
- There are 1984 "surrogate" code points set aside for WTF-16 (see below). (Unicode: 2048 surrogates)
  - - The surrogates are at the upper end of the first plane (plane 0) so that WTF-16 code units compare lexically like sequences of code points. (Unlike in UTF-16)
    - Leading surrogates: W-0fc00..W-0ffbf. (Unicode: U+d800..U+dbff)
    - Trailing surrogates: W-0f800..W-0fbff. (Unicode: U+dc00..U+dfff)
    - In WCode, leading surrogates have higher code point values than trailing surrogates, while in Unicode it is the reverse. This is purely for a more natural alignment of the ranges. In WCode, the smaller range for the 960 leading surrogates (Unicode: 1024 leading surrogates) offers a natural range of 64 code points at the end of plane 0, while both surrogate ranges start at "even" values.
- W-0ffc0..W-0ffff are 64 non-characters and should be used only internally to applications. W-0fff0..W-0ffff are reserved for WCode itself and used like Unicode U+fff0..U+ffff (see also the reverse BOM below), the other 48 non-characters can be used freely in applications. W-xfffe and W-xffff (x=0..f) are unassigned but not non-characters.
- (Unicode: non-characters are U+fff0..U+ffff, U+fdd0..U+fdef, U+xxfffe, U+xxffff with xx=00..10)
- Private-Use Areas: W-0d800..W-0f0ff (6400 in plane 0) and W-f0000..W-fffff (64k, plane 15).
- (Unicode: U+e000..U+f8ff (6400), U+f0000..U+ffffd (64k-2), U+100000..U+10fffd (64k-2))
- BOM: W-0f7ff (Unicode: U+feff)
- Since all non-characters W-0ffc0..W-0ffff should never be exchanged as characters, the byte sequence 0xff 0xf7 at the beginning of a WTF-16 stream can be interpreted as a reverse BOM to detect the byte order.

Assigned Characters

- WCode contains US-ASCII and C1 controls as direct subsets, but not most of ISO 8859-1. (Unicode: contains all of them as direct subsets)
- All scripts are encoded in logical order. (Unicode: Thai and Lao are not encoded in logical order)
- WCode contains all Unicode characters except ones with a decomposition of any kind. Normalization on WCode only sorts combining characters in canonical order. (This removes some 13000(?) characters from the BMP. WCode is mostly Unicode NFKD.)
- Private-Use code points are moved from U+e000..U+f8ff to W-0d800..W-0f0ff, and non-decomposable characters U+f900..U+ffef are moved to W-0f100..W-0f7ef to make space for surrogates at the top of plane 0.
- U+feff corresponds to two WCode characters: W-0f6ff ZWNBSP and W-0f7ff BOM.
- Otherwise, WCode code points are the same as Unicode code points.
- WCode code points that parallel assigned but decomposable Unicode code points should remain unassigned (for now).

Encoding Forms

- WTF-16 (preferred): Like UTF-16, with surrogate values as above.
- WTF-32 (fixed-width): Like UTF-32.
- WTF-8 (compatible with US-ASCII and C1 controls): Like UTF-8C1.
- SCSW (byte-based, compressed, stateful): Like SCSU.

Conversion from and to Unicode

Unicode text is converted to WCode by normalizing it (NFKD), reordering Thai/Lao, moving U+e000..U+ffef to W-0d800..W-0f7ef, and changing the UTF to a WTF. In addition, U+feff is converted to either W-0f6ff if it is used as a ZWNBSP or to W-0f7ff if it is used as a BOM.

Plane 16 cannot be encoded.

After normalization and reordering, UTF-16 text can be transformed easily to WTF-16 by transforming code units. (In fact, this is similar to the "fix-up" necessary for comparing UTF-16 strings in code point order, except that U+fff0..U+ffff become W-0fff0..W-0ffff, the leading and trailing surrogate ranges are reversed, and there are fewer leading surrogates.)

WCode text is converted to Unicode by reversing the above steps, without the need for normalization.

What is missing from Unicode?

- Unicode contains many precomposed and compatibility characters, which were included so that the adoption of Unicode would be easier.
- Concerns were mainly 1:1 character conversion and simple display methods. As Unicode matures, new characters are added with more respect to the founding principles of designing a text encoding that is optimized for text processing. WCode represents a "What could have been without regard to legacy data and implementations".
- The above forces WCode to not contain much of ISO 8859-1 because of the many precomposed characters.
- In order to get a more natural range of code points, WCode has one plane fewer than Unicode. The missing plane is a Private-Use Area plane.

WTF-8

Single bytes: 0..0x9f

Lead bytes: 0xa0..0xbf

Trail bytes: 0xc0..0xff

Overlap: 0x10000..0x1039f should be encoded with 4 bytes, not 3

Illegal: The 4-byte-accessible value range 0x100000..0x10ffff

(See also UTF-8C1.)

Acknowledgements

I would like to thank Mark Davis for feedback.

Page updated

Google Sites

Report abuse