(Moved here unchanged from defunct http://www.mindspring.com/~markus.scherer/unicode/utf-16-alt.txt)
Alternative 16-bit Unicode encoding form(s) another "Gedankenexperiment" not a proposal! Markus W. Scherer 2001-oct-13 This is an idea for a 16-bit form of Unicode that is sufficiently similar to UTF-16 to be mostly compatible with UTF-16 software, but modifies it to "fix" the main perceived shortcomings of UTF-16, which are: - Surrogate code unit values are not the highest used in the UTF-16 encoding, which creates a binary order of UTF-16 units that is different from Unicode code point order. - Surrogate code points cannot be encoded unambiguously in UTF-16; for example, in UTF-16, the sequence of U+d800 U+dc00 is indistinguishable from U+10000. - The Byte Order Mark U+feff can be confused with its dual function as a Zero-Width No-Break Space. A 16-bit encoding of Unicode is possible that improves on these shortcomings, at the expense of compatibility with UTF-16 software. For maximum possible similarity with UTF-16, such an encoding could encode Unicode code points as follows: code points 16-bit code units ---------------------------------------- 0000.. 0000.. (same as UTF-16) d7bf d7bf d7c0 dc00.. d7f5 dbbf are not used to encode Unicode code points (55232 sentinel pairs) d7c0.. d7f5 dbc0.. d7ff d7f5 dbff d800.. d7f6 dc00.. ffef d7ff dfef d7ff dff0.. d7ff dfff are not used to encode Unicode code points (16 sentinel pairs) fff0.. fff0.. (16 BMP specials) ffff ffff 10000.. d800 dc00.. (same as UTF-16) 10ffff dbff dfff e000..ffef are not used to encode Unicode code points; they can be used as (2032) internal sentinels Such an encoding would improve on the perceived shortcomings by - using the surrogate code unit values as the highest ones for encoding Unicode code points - unambiguously encoding surrogate code points as pairs of surrogate code units - designating _both_ code units 0xfeff and 0xfffe as non-characters and encoding the Unicode code point U+feff with a surrogate pair Drawbacks: - Incompatibility with existing UTF-16 software - Code points U+d7c0..U+ffef are encoded with 2 code units instead of 1. Note that U+d7c0..U+d7ff are currently unassigned, U+d800..U+dfff are surrogate non-characters, U+e000..U+f8ff are private-use characters, and of the 1776 code points U+f900..U+ffef there are + the ZWNBSP (U+feff) [not used as BOM!] + 12 "real" CJK ideographs (U+fa0e..U+fa0f, U+fa11, U+fa13..U+fa14, U+fa1f, U+fa21, U+fa23..U+fa24, U+fa27..U+fa29) + 7 further characters without decompositions (U+fb1e, U+fd3e..U+fd3f, U+fe20..U+fe23) + 32 non-character code points (U+fdd0..U+fdef) + all other code points are used for compatibility characters - Software with insufficient range checking could create a non-shortest-form problem like in UTF-8 Sample code for reading Unicode code points and sentinel values from such an encoding: int32_t getCodePointFromAlternate16BitForm(const UChar *s, int32_t &i, int32_t length) { int32_t c; UChar c2; if(i<0 || length<=i) { return -1; } c=s[i++]; if(c<0xd7c0 || c>=0xfff0) { /* from most of the BMP */ return c; } else if(c>=0xdc00 || i==length || (c2=s[i])<0xdc00 || c2>=0xe000) { /* sentinel 0x8000d7c0..0x8000ffef from a single 16-bit unit */ return c|0x80000000; } else { ++i; c=(c<<10)+c2-((0xd7c0<<10L)+0xd800); /* same calculation as in UTF-16 */ if(c>=0x10000 || 0xd7c0<=c && c<=0xffef) { /* code points U+d7c0..U+ffef and U+10000..U+10ffff */ return c; } else { /* * sentinel 0x80000000..0x8000d7bf, * 0x8000fff0..0x8000ffff from non-shortest form */ return c|0x80000000; } } } ------------------------------------------------------------------------ Another possibility: In a 16-bit encoding that differs more from UTF-16, one could move the surrogate values to the very end of the 16-bit range (except for a small distance to 0xffff). For example, such an encoding could use 1088 lead surrogate code units 0xf780..0xfbbf, 1024 trail surrogate code units 0xfbc0..0xffbf, and encode Unicode code points as follows: code points 16-bit code units ---------------------------------------- 0000.. 0000.. (most of BMP, most of PUA) f77f f77f f780 fbc0.. d7bd ff3f are not used to encode Unicode code points (63360 sentinel pairs) f780.. f7bd ff40.. f7ff f7bd ffbf f800.. f7be fbc0.. ffef f7bf ffaf f7bf ffb0.. f7bf ffbf are not used to encode Unicode code points (16 sentinel pairs) fff0.. fff0.. (16 BMP specials) ffff ffff 10000.. f7c0 fbc0.. (supplementary characters) 10ffff fbbf ffbf ffc0..ffef are not used to encode Unicode code points; they can be used as (48) internal sentinels Advantage compared to the previous: - Almost all of the BMP Private-Use Area is encoded with single code units. Disadvantage: - Entirely different surrogate code unit values. Note: - 0xfeff could still be used as a BOM because it is a trail surrogate; text starting with a trail surrogate is otherwise not valid (other values 0xfffb..0xfffd and their mirrors could similarly be used for BOMs different from UTF-16 as distinction; 0xfdff would be best because U+fdff is unassigned and U+fffd is the substitution character) Other than that, advantages and disadvantages are the same as with the above. int32_t getCodePointFromOther16BitForm(const UChar *s, int32_t &i, int32_t length) { int32_t c; UChar c2; if(i<0 || length<=i) { return -1; } c=s[i++]; if(c<0xf780 || c>=0xfff0) { /* from most of the BMP */ return c; } else if(c>=0xfbc0 || i==length || (c2=s[i])<0xfbc0 || 0xffc0<=c2) { /* sentinel 0x8000f780..0x8000ffef from a single 16-bit unit */ return c|0x80000000; } else { ++i; c=(c<<10)+c2-((0xf780<<10L)+0xfbc0); if(c>=0x10000 || 0xf780<=c && c<=0xffef) { /* code points U+f780..U+ffef and U+10000..U+10ffff */ return c; } else { /* * sentinel 0x80000000..0x8000f77f, * 0x8000fff0..0x8000ffff from non-shortest form */ return c|0x80000000; } } } int32_t transformFromUTF16(UChar *dest, int32_t destCapacity, const UChar *src, int32_t srcLength) { const UChar *limit; int32_t length; UChar c, c2; if(srcLength<-1) { return 0; } else if(srcLength==-1) { limit=NULL; } else /* srcLength>=0 */ { limit=src+srcLength; } length=0; while(src!=limit) { c=*src++; if(c==0 && limit==NULL) { break; } if(c<0xd800 || c>=0xfff0) { c2=0; /* shared single-unit BMP code point */ } else if(c<=0xdbff && src!=limit && 0xdc00<=(c2=*src) && c2<=0xdfff) { ++src; c+=0x1fc0; /* shift up pairs for U+10000..U+10ffff */ c2+=0x1fc0; } else if(c<0xf780) { c2=0; /* single-unit code points U+d800..U+f77f */ } else /* c>=0xf780 */ { /* code points U+f780..U+ffef encoded with pairs */ c2=0xfbc0+(c&0x3ff); c=0xf780+(c>>10); } if(length