What If Unicode Had Been Born Byte-Based
(Moved here from defunct http://www.mindspring.com/~markus.scherer/unicode/uni-byte-based.html)
Markus Scherer, 2005-oct-28
Unicode was not designed for bytes
The Unicode Standard was originally designed for processing strings of 16-bit units, while nearly all legacy character sets were based on byte streams (even if they needed pairs or triplets of bytes for many characters). Encoding text with larger-than-byte units simplifies processing a lot when the character set is very large, and even Unicode 1.0 had more than 28,000 characters.
For text data exchange, Unicode text was to be either simply byte-serialized (originally only in big-endian byte order), or some other transformation was to be used as necessary.
This created a lasting tension with existing text processing systems that used to be byte-based. Thus many other Unicode encoding forms and charsets were created, foremost among them UTF-8, and software often needs to convert between them even if Unicode is used everywhere.
What if Unicode had been born byte-based?
This is a "What if?" Gedankenexperiment, not something that will happen. It's also based on knowledge about the design and use of Unicode that was not available at a time when the basic design was being done.
If Unicode had been born byte-based, then there would have been much less pressure for so many different charsets for it. Of course, processing of multi-byte sequences would still have been less efficient than using larger units.
What might it have looked like?
For example:
or:
Some characteristics in comparison with Unicode and UTF-8:
U+0000..U+007F encoded as 00..7F like US-ASCII.
All other characters encoded with at most three bytes each. Fewer length options lead to more efficient processing.
A two-byte form for some characters would have been possible, as a trade-off between simpler processing (only 1- and 3-byte forms) and somewhat more compact Latin/Greek/Cyrillic strings.
Smaller range of code points than in modern Unicode, but sufficient for currently anticipated requirements.
Lead bytes adjacent to single-byte values to simplify the initial range checks.
Trail bytes at the top of the byte value range simplifies validity checking as well.
These encoding schemes maintain the random-access-friendly non-overlap design of UTF-8.