(Moved here unchanged from defunct http://www.mindspring.com/~markus.scherer/unicode/base16k.html)
Efficient Binary Data Encoding in Unicode Text
Markus Scherer, 2004-may-28
Binary data, that is, arbitrary data using unrestricted byte values, is frequently transported in text data formats where the syntax limits the available byte values. (Examples: Binary data in SMTP emails and in XML.) There are commonly used encodings for representing binary data in text; they map binary values to a small repertoire of characters that is available in almost all legacy codepages. Unicode presents an incentive and an opportunity for efficient encoding of binary data by using a larger repertoire of characters, resulting in a more efficient encoding than common methods.
The base64 encoding [RFC3548] segments binary data into 6-bit codes and represents each of them with one of 64 characters that are common to most legacy codepages including US-ASCII, namely lower- and uppercase latin letters and decimal digits, and a couple of punctuation characters. The repertoire is sometimes modified depending on the text format syntax.
This works well with legacy codepages where all of these characters are represented using single bytes. With 8-bit bytes and 6 bits per byte for the "payload", the efficiency is 75%.
Base64 is equally efficient in UTF-8, which is commonly used for Unicode text data exchange, but in many applications the base64 text will first expand into UTF-16, together with the rest of the document, before it is decoded into the original binary form. While in UTF-16 form, the effficiency drops to 6 bits per 16-bit code unit, or 37.5%. Using UTF-16 for the text data exchange as well, which is quite reasonable or even desirable for non-Latin scripts, yields this same low efficiency during transmission.
(A variant of base64 is actually part of the UTF-7 encoding for representing Unicode text for very restricted media.)
The large character repertoire of Unicode and ISO 10646 together with their near-universal use in modern software allow for very efficient encoding of binary data in Unicode. Such encoding can segment binary data into wider codes to use more of the code unit bits, using a large repertoire of character code points.
The character repertoire should have the following properties:
The repertoire should be as large as possible.
The number of characters should be a power of 2 to simplify processing.
The repertoire should ideally be represented by one single, contiguous range of Unicode code points, to avoid table lookups.
The code points should be on the BMP (below U+ffff) to use short character encodings.
The characters should be inert under most Unicode text transformations, especially normalization, but ideally also case mapping etc., so that such an encoding of binary data does not get corrupted by common processing.
Unassigned and private-use code points should be avoided because they are often restricted and could be affected by future or custom processing.
These properties point to using 16384 (=16*1024=16k) contiguous character code points out of the Unified Han repertoire. For the base16k encoding, the range U+5000..U+8fff is chosen to encode 14-bit codes from the segmented binary data. This yields an efficiency of 87.5% for the UTF-16 encoding form (for processing) as well as for the UTF-16, UTF-16BE/LE, SCSU and BOCU-1 charsets (for data exchange). When transformed into UTF-8, the efficiency is still 58.3% (14 bits per Han character encoded in 3 bytes), which is not as good as base64 in UTF-8, but much better than base64 in UTF-16.
As usual, the binary data bytes are encoded in memory order, and the most significant bit in the first byte is aligned with the most significant bit in the 14-bit code word, etc.
Encoding Length and Sequence Termination
By segmenting the binary data into 14-bit codes, a byte boundary in the binary data corresponds to a code/character boundary every 7 bytes, or every 4 codes/characters. The following table shows how many Han character code points are used in the encoding, with the numbers being multiples of N (a whole number).
Unlike in base64, where the number of characters per binary byte is unique (because the code segments are shorter than bytes), base16k cannot be unambiguously terminated with a character from outside the chosen repertoire. For example, 6 characters could encode either 9 or 10 bytes. It is possible to choose a secondary, distinct repertoire of 64 characters to carry the last up to 6 bits of the last binary byte if the number of bytes is not a multiple of 7. This is an unaesthetic complication.
A more elegant method to terminate the sequence precisely is to precede it with the number of binary bytes, which is anyway desirable to pre-allocate space in the decoder. For base16k, the encoding of the binary data with Han characters shall be preceded by the number of bytes expressed as a decimal number, using ASCII digits U+0030..U+0039 and no punctuation.
Further, to simplify decoding and transport, base16k shall be treated leniently:
Leading zeroes are allowed for the number of bytes.
Any non-digit (any character outside U+0030..U+0039) terminates the number of bytes.
After the number of bytes:
All characters except U+5000..U+8fff are ignored. This explicitly allows arbitrary spaces and line breaks.
Only the necessary number of Han characters in the range U+5000..U+8fff indicated by the table above are decoded into binary data.
Excess bits in the last decoded Han character but beyond the number of bits necessary are ignored.
It is possible to extend the encoding by adding further data fields, for example for a checksum. Such extensions are not essential to the encoding and thus left to higher-level protocols. The leniency of the decoder allows to append such data fields without disturbing the decoder.
Read Unicode characters or UTF-16 code units in memory order and process as follows. (Surrogates in UTF-16 can be ignored because supplementary code points are ignored.)
Read the number of bytes.
Start with an initial number of 0.
While there is a digit U+0030..U+0039, multiply the number value by 10 and add the digit value.
There must be at least one digit.
Any non-digit terminates the number. Proceed with step 2.
Read the binary data.
Stop if the number of bytes has reached 0. (It is decremented in one of the following steps.)
Signal an error if there are no more characters.
Ignore any characters outside of U+5000..U+8fff.
For a character in the range U+5000..U+8fff, subtract 0x5000 from the code point value to get a value 0..0x3fff.
Use the initial bits to complete one byte of binary data, together with remaining bits from the previous character if necessary.
Decrement the number of bytes. Stop if 0.
If there are 8 more bits available in the current value, then emit another byte of binary data and decrement the number of bytes.
Keep the remaining value bits for the next iteration.
binary in: (pairs of hex digits)
code points of the Han characters: (readonly)
Base16k with Legacy Codepages
In theory, base16k works with legacy codepages which contain all of the necessary characters. By design, this is true for GBK/windows-936 and GB 18030. Since they also encode Han characters with 2 bytes each, they would achieve the same efficiency of 87.5%. However, due to the variations of conversion tables and the size of such tables and the performance of table-based conversion, the use of base16k with legacy codepages is not advisable.
2004-05-25: First formal write-up.
2004-05-10: Refined to current base16k.
2004-05-07: Initial idea for a base4k encoding.
[RFC3548] RFC 3548 - The Base16, Base32, and Base64 Data Encodings (http://www.faqs.org/rfcs/rfc3548.html)
Oren Ben-Kiki asking in 2001 whether a "base4096" exists: http://www.geocrawler.com/archives/3/12303/2001/5/0/5850798/
Rick Jelliffe arguing in 1998 that binaries in XML can use something like base64, or "you might want to invent your own Base4K encoding": http://lists.xml.org/archives/xml-dev/199806/msg00378.html
John Cowan replied, proposing base-256 (with U+f000..U+f0ff).