Quarter Unicode

(Moved here unchanged from defunct http://www.mindspring.com/~markus.scherer/unicode/quarter-unicode.html)

Markus Scherer, 2006-feb-05

This is another Gedankenexperiment, expanding on "What If Unicode Had Been Born Byte-Based".

Observation:

1. In a multi-byte encoding it is necessary to have non-overlapping lead and trail bytes for efficient backward iteration and random access.
2. However, even in a variable-length encoding with multiple sequence lengths, it is not actually necessary to have lead and trail bytes completely non-overlapping in the same way for all sequence lengths. Pretty efficient access is possible if only the byte patterns are sufficiently distinct.
3. As a result, more byte values are available for lead and trail bytes, and therefore more two-byte characters are possible.

A Quarter of the Unicode Code Space

It is very unlikely that there will be more than 256k (256×1024) assigned or designated code points in Unicode, if Private Use Areas in planes 15 and 16 are ignored. A byte-based encoding for 256k code points could be more efficient than UTF-8 for those code points, and would support slightly less than a quarter of the 1088k standard Unicode code points.

Unfortunately, an encoding of Unicode could not simply encode U+0000..U+3FFFF because there are already characters assigned on plane 14. In addition, it is useful for string sorting (for constructing upper-boundary strings) to be able to use the highest code point U+10FFFF.

A mapping should cover the necessary code point ranges:

The two exceptionally mapped ranges are chosen for convenience for the following byte-based encoding.

UQ8: Byte-Based Encoding of a Quarter of Unicode

Definitions of symbols for byte value ranges:

Single-byte values 00..7F

Low values 80..BF

High values C0..FF

Both Low and High values are used for lead and trail bytes according to the following table. Each of the L and H ranges has 64 values, that is, each can carry 6 bits of information.

Encode intermediate values as follows:

The Unicode code points U+E0xxx have exactly the lead byte FE. U+10Fxxx have exactly the lead byte FF.

Character Boundaries

Detecting character boundaries is slightly harder in UQ8 than in UTF-8 because both L and H values are used as lead and trail bytes. Still, only a small number of bytes needs to be examined. See sample C code.