UTF-8C1: A Safe and Simple Unicode Encoding Form

(Moved here unchanged from defunct http://www.mindspring.com/~markus.scherer/unicode/utf-8c1.html)

Markus W. Scherer, 2000-Mar-12

This is a Gedankenexperiment for what could have been UTF-8 if it had been defined after the Unicode range was set to that of UTF-16, with code points up to only 0x10ffff and not up to UCS-4's 0x7fffffff. I realize that this comes about 7 years too late, but I like the elegance that is possible with a custom-fit encoding form. Be my guest.

The name indicates that this is an encoding form that uses 8-bit code units and is C1-control-code-safe.

Important: This is not an approved encoding for Unicode nor for ISO 10646. It is also not a proposal for a new UTF. It is only meant as a "what if". UTF-8C1 is not compatible with UTF-8.

Design goals:

- Byte-based, stateless encoding.
- Code points 0..0x9f encoded as single bytes with the same values so that control codes C0, DEL, C1 are safely encoded.
- Code point range closely matching the Unicode range, up to 0x10ffff.
- No or few code points ambiguously codeable - a concern with UTF-8 especially for the control codes.
- Short sequences with 6 bits/trail byte.

Encoding:

Note that 928 code points from 0x10000..0x1039f can be encoded with either 3 or 4 bytes. They should be encoded with 4 bytes. (This is arbitrary - I like it better this way.) Encoding them the other way around results in irregular sequences.

Signature byte sequence: Like the other Unicode encodings, the byte sequence representing U+feff shall be used as a signature. It is 0xbb 0xed 0x9f.

Properties and comparison with UTF-8

- The code point range matches exactly that of Unicode and UTF-16. No range checks are necessary when reading UTF-8C1.
- This encoding does not have the "minimum-length problem" of UTF-8, except for 928 code points in a "harmless" area. It is more important that the ISO control codes are unambiguously encoded. Being able to embed NUL or other control characters in multi-byte sequences with UTF-8 raises security concerns. This is why the 2-byte and 3-byte ranges are moved up by offsets of 0xa0 and 0x3a0, respectively. Alternatives would include to have ambiguous encodings for the control characters or to have legal encodings for values up to 0x11039f, which would have made a range check necessary in safety-minded implementations.
- There are fewer 2-byte sequences. This is because, with the C1 control codes being reserved for single-byte codes, there are few bits available in lead bytes for a larger 2-byte code range while still encoding the BMP with up to 3 bytes and all of Unicode with up to 4.
- The 2-byte sequences still cover the Latin script range up to 0x39f. Compared with UTF-8, Greek, Hebrew, Arabic, and Cyrillic need 3-byte instead of 2-byte sequences.
- The 3- and 4-byte sequences cover the same Unicode code ranges as in UTF-8.
- With UTF-EBCDIC, which uses the same number of byte values for single- and multi-byte-sequences, almost all of the CJK, Hangul, Yi, and compatibility areas of the BMP need 4 bytes per code point. UTF-8C1 needs only 3. Similarly, planes 4 to 16 need 5 bytes with UTF-EBCDIC and only 4 with UTF-8C1. This is because UTF-EBCDIC uses only 5 value bits in trail bytes.
- Like with UTF-8, the binary sorting order of UTF-8C1 is the same as that for UTF-32.
- UTF-8C1 has the same kind of separation of code unit values into lead bytes, trail bytes, and single bytes as the other UTFs. This makes random access possible.
- There are no unused byte values. UTF-8 has two unused byte values (0xfe and 0xff), while UTF-16 also does not have any unused code unit values.
- An EBCDIC-friendly encoding form could be constructed from this by using byte mapping tables like in UTF-EBCDIC. (This could be called "UTF-8C1-EBCDIC".)

Sample code

The following sample C code pieces do not check for valid code points, valid trail bytes, or array length overrun.

Writing UTF-8C1

unsigned char *s; unsigned int c; int i; /* write code point c into array s starting at index i */ if(c<=0x9f) { s[i++]=(unsigned char)c; } else { if(c<=0x39f) { c-=0xa0; s[i++]=0xa0|(c>>6); } else { if(c<=0xffff) { c-=0x3a0; s[i++]=0xac+(c>>12); } else { c-=0x10000; s[i++]=0xbc|(c>>18); s[i++]=0xc0|((c>>12)&0x3f); } s[i++]=0xc0|((c>>6)&0x3f); } s[i++]=0xc0|(c&0x3f); }

Reading UTF-8C1

unsigned char *s; unsigned int c; int i; /* read from array s starting at index i into code point c */ c=s[i++]; if(c>=0xa0) { if(c<=0xab) { c=0xa0+(((c&0xf)<<6)|(s[i++]&0x3f)); } else if(c<=0xbb) { c=0x3a0+(((c-0xac)<<12)|((s[i++]&0x3f)<<6)|(s[i++]&0x3f)); } else if(c<=0xbf) { c=0x10000+(((c&3)<<18)|((s[i++]&0x3f)<<12)|((s[i++]&0x3f)<<6)|(s[i++]&0x3f)); } else { /* trail byte */ } }