Half Unicode [norandom]

(Moved here unchanged from http://www.mindspring.com/~markus.scherer/unicode/half-unicode-norandom.html)

Markus Scherer, 2006-mar-04

This is another Gedankenexperiment, expanding on "What If Unicode Had Been Born Byte-Based" and "Quarter Unicode".

Observation:

1. In a multi-byte encoding it is necessary to have non-overlapping lead and trail bytes for efficient backward iteration and random access.
2. However, even in a variable-length encoding with multiple sequence lengths, it is not actually necessary to have lead and trail bytes completely non-overlapping in the same way for all sequence lengths. Pretty efficient access is possible if only the byte patterns are sufficiently distinct.
3. As a result, more byte values are available for lead and trail bytes, and therefore more two-byte and three-byte characters are possible.

Half of the Unicode Code Space

It is very unlikely that there will be more than 512k (512×1024) assigned or designated code points in Unicode, including Private Use Areas in planes 15 and 16 (which don't fit into Quarter Unicode). A byte-based encoding for 512k code points could be more efficient than UTF-8 for those code points, and would support slightly less than half of the 1088k standard Unicode code points.

Unfortunately, an encoding of Unicode could not simply encode U+0000..U+7FFFF because there are already characters assigned on plane 14. In addition, it is useful for string sorting (for constructing upper-boundary strings) to be able to use the highest code point U+10FFFF, and all of planes 15 and 16 may be used for Private Use characters.

A mapping should cover the necessary code point ranges:

Code Points

0000..4FFFF

E0000..10FFFF

Intermediate Values

0000..4FFFF

50000..7FFFF

The cut-off between the two ranges is arbitrary between U+40000 and U+50000 because there are currently no plans for any code point assignments on planes 4 through 13.

The only inaccessible code points with any designation are the noncharacters on the excluded planes.

UH8: Byte-Based Encoding of Half of Unicode

Definitions of symbols for byte value ranges:

Single-byte values 00..7F

Multi-byte values 80..FF

Low multi-byte values 80..BF

High multi-byte values C0..FF

Both Low and High values are used for lead and trail bytes according to the following table. Each of the L and H ranges has 64 values, that is, each can carry 6 bits of information. The M range combines L and H for 128 values or 7 bits of information.

Encode intermediate values as follows:

Character Boundaries

Detecting character boundaries is slightly harder in UH8 than in UTF-8 because both L and H values are used as lead and trail bytes, and overlap with M values. Still, only a small number of bytes needs to be examined. See sample C code.

As an advantage compared with UTF-8, backward iteration can detect the sequence length from the last trail byte, exactly symmetrical to forward iteration.