Half Unicode [norandom]
(Moved here unchanged from http://www.mindspring.com/~markus.scherer/unicode/half-unicode-norandom.html)
Markus Scherer, 2006-mar-04
This is another Gedankenexperiment, expanding on "What If Unicode Had Been Born Byte-Based" and "Quarter Unicode".
Observation:
In a multi-byte encoding it is necessary to have non-overlapping lead and trail bytes for efficient backward iteration and random access.
However, even in a variable-length encoding with multiple sequence lengths, it is not actually necessary to have lead and trail bytes completely non-overlapping in the same way for all sequence lengths. Pretty efficient access is possible if only the byte patterns are sufficiently distinct.
As a result, more byte values are available for lead and trail bytes, and therefore more two-byte and three-byte characters are possible.
Half of the Unicode Code Space
It is very unlikely that there will be more than 512k (512×1024) assigned or designated code points in Unicode, including Private Use Areas in planes 15 and 16 (which don't fit into Quarter Unicode). A byte-based encoding for 512k code points could be more efficient than UTF-8 for those code points, and would support slightly less than half of the 1088k standard Unicode code points.
Unfortunately, an encoding of Unicode could not simply encode U+0000..U+7FFFF because there are already characters assigned on plane 14. In addition, it is useful for string sorting (for constructing upper-boundary strings) to be able to use the highest code point U+10FFFF, and all of planes 15 and 16 may be used for Private Use characters.
A mapping should cover the necessary code point ranges:
Code Points
0000..4FFFF
E0000..10FFFF
Intermediate Values
0000..4FFFF
50000..7FFFF
The cut-off between the two ranges is arbitrary between U+40000 and U+50000 because there are currently no plans for any code point assignments on planes 4 through 13.
The only inaccessible code points with any designation are the noncharacters on the excluded planes.
UH8: Byte-Based Encoding of Half of Unicode
Definitions of symbols for byte value ranges:
S
Single-byte values 00..7F
M
Multi-byte values 80..FF
L
Low multi-byte values 80..BF
H
High multi-byte values C0..FF
Both Low and High values are used for lead and trail bytes according to the following table. Each of the L and H ranges has 64 values, that is, each can carry 6 bits of information. The M range combines L and H for 128 values or 7 bits of information.
Encode intermediate values as follows:
Character Boundaries
Detecting character boundaries is slightly harder in UH8 than in UTF-8 because both L and H values are used as lead and trail bytes, and overlap with M values. Still, only a small number of bytes needs to be examined. See sample C code.
As an advantage compared with UTF-8, backward iteration can detect the sequence length from the last trail byte, exactly symmetrical to forward iteration.