UTF-8 Byte Sequences

(Moved here unchanged from defunct http://www.mindspring.com/~markus.scherer/unicode/utf-8-bytes.html)

Markus W. Scherer

2002-aug-10

UTF-8 is specified with a simple algorithm, but its large number of sequence lengths and its byte value restrictions result in a large number of illegal byte sequences. A conformant decoder must detect malformed sequences and well-formed but otherwise illegal sequences.

A simple UTF-8 decoder function may return a pair of values — a boolean "is legal" flag and a 32-bit code point — while moving an index to the input code units ahead past the decoded sequence. It is possible to return unique values for each illegal sequence, together with the "legal" flag set to false. With the following suggested error values it is simple to subsume the "is legal" flag in the return value and to test the "is legal" status just by testing if the return value is at or above 80000000 (or 110000 for Unicode), or if it is a surrogate D800..DFFF.

The following table lists well-formed legal and illegal sequences as well as malformed ones, with suggested, unique error return values for illegal ones. It assumes that the decoder function always consumes at least one byte, and after a lead byte consumes as many trail bytes as the lead byte indicates, but that it also stops consuming bytes as soon as (before) it finds the first non-trail byte after a lead byte. This suggested behavior helps resynchronizing after an illegal sequence.

Other possible error handling strategies would result in fewer or more illegal sequences and values. For example, a much simpler strategy is to treat each of the sequences listed as illegal below as a sequence of single-byte errors, with only 128 error return values but slow resynchronization. Another example is to synchronize as suggested below but to return only 6 different values like -1..-6 indicating the length of the illegal sequence.

The suggested error return values all have bit 31 set, except for single surrogate values (with a *) which are suggested to be returned with their natural values (with the "legal" flag set to false, of course).

The table further assumes the original definition of UTF-8. The Unicode Standard additionally forbids values 110000..7FFFFFFF. For a Unicode UTF-8 decoder function that follows the suggested scheme for best-effort resynchronization the ratio of illegal sequences to legal ones is about 200010:1! By comparison, for a similarly synchronizing UTF-16 decoder this ratio is almost the inverse, about 1:50010.

All values below are written in hexadecimal notation.