Quarter Unicode
(Moved here unchanged from defunct http://www.mindspring.com/~markus.scherer/unicode/quarter-unicode.html)
Markus Scherer, 2006-feb-05
This is another Gedankenexperiment, expanding on "What If Unicode Had Been Born Byte-Based".
Observation:
In a multi-byte encoding it is necessary to have non-overlapping lead and trail bytes for efficient backward iteration and random access.
However, even in a variable-length encoding with multiple sequence lengths, it is not actually necessary to have lead and trail bytes completely non-overlapping in the same way for all sequence lengths. Pretty efficient access is possible if only the byte patterns are sufficiently distinct.
As a result, more byte values are available for lead and trail bytes, and therefore more two-byte characters are possible.
A Quarter of the Unicode Code Space
It is very unlikely that there will be more than 256k (256×1024) assigned or designated code points in Unicode, if Private Use Areas in planes 15 and 16 are ignored. A byte-based encoding for 256k code points could be more efficient than UTF-8 for those code points, and would support slightly less than a quarter of the 1088k standard Unicode code points.
Unfortunately, an encoding of Unicode could not simply encode U+0000..U+3FFFF because there are already characters assigned on plane 14. In addition, it is useful for string sorting (for constructing upper-boundary strings) to be able to use the highest code point U+10FFFF.
A mapping should cover the necessary code point ranges:
The two exceptionally mapped ranges are chosen for convenience for the following byte-based encoding.
UQ8: Byte-Based Encoding of a Quarter of Unicode
Definitions of symbols for byte value ranges:
S
Single-byte values 00..7F
L
Low values 80..BF
H
High values C0..FF
Both Low and High values are used for lead and trail bytes according to the following table. Each of the L and H ranges has 64 values, that is, each can carry 6 bits of information.
Encode intermediate values as follows:
The Unicode code points U+E0xxx have exactly the lead byte FE. U+10Fxxx have exactly the lead byte FF.
Character Boundaries
Detecting character boundaries is slightly harder in UQ8 than in UTF-8 because both L and H values are used as lead and trail bytes. Still, only a small number of bytes needs to be examined. See sample C code.