What If Unicode Had Been Born Byte-Based

(Moved here from defunct http://www.mindspring.com/~markus.scherer/unicode/uni-byte-based.html)

Markus Scherer, 2005-oct-28

Unicode was not designed for bytes

The Unicode Standard was originally designed for processing strings of 16-bit units, while nearly all legacy character sets were based on byte streams (even if they needed pairs or triplets of bytes for many characters). Encoding text with larger-than-byte units simplifies processing a lot when the character set is very large, and even Unicode 1.0 had more than 28,000 characters.

For text data exchange, Unicode text was to be either simply byte-serialized (originally only in big-endian byte order), or some other transformation was to be used as necessary.

This created a lasting tension with existing text processing systems that used to be byte-based. Thus many other Unicode encoding forms and charsets were created, foremost among them UTF-8, and software often needs to convert between them even if Unicode is used everywhere.

What if Unicode had been born byte-based?

This is a "What if?" Gedankenexperiment, not something that will happen. It's also based on knowledge about the design and use of Unicode that was not available at a time when the basic design was being done.

If Unicode had been born byte-based, then there would have been much less pressure for so many different charsets for it. Of course, processing of multi-byte sequences would still have been less efficient than using larger units.

What might it have looked like?

For example:

or:

Some characteristics in comparison with Unicode and UTF-8:

    • U+0000..U+007F encoded as 00..7F like US-ASCII.

    • All other characters encoded with at most three bytes each. Fewer length options lead to more efficient processing.

    • A two-byte form for some characters would have been possible, as a trade-off between simpler processing (only 1- and 3-byte forms) and somewhat more compact Latin/Greek/Cyrillic strings.

    • Smaller range of code points than in modern Unicode, but sufficient for currently anticipated requirements.

    • Lead bytes adjacent to single-byte values to simplify the initial range checks.

    • Trail bytes at the top of the byte value range simplifies validity checking as well.

    • These encoding schemes maintain the random-access-friendly non-overlap design of UTF-8.