(Moved here unchanged from defunct http://www.mindspring.com/~markus.scherer/unicode/base16k.html)
By segmenting the binary data into 14-bit codes, a byte boundary in the binary data corresponds to a code/character boundary every 7 bytes, or every 4 codes/characters. The following table shows how many Han character code points are used in the encoding, with the numbers being multiples of N (a whole number).
|Number of bytes||Number of characters|
Unlike in base64, where the number of characters per binary byte is unique (because the code segments are shorter than bytes), base16k cannot be unambiguously terminated with a character from outside the chosen repertoire. For example, 6 characters could encode either 9 or 10 bytes. It is possible to choose a secondary, distinct repertoire of 64 characters to carry the last up to 6 bits of the last binary byte if the number of bytes is not a multiple of 7. This is an unaesthetic complication.
A more elegant method to terminate the sequence precisely is to precede it with the number of binary bytes, which is anyway desirable to pre-allocate space in the decoder. For base16k, the encoding of the binary data with Han characters shall be preceded by the number of bytes expressed as a decimal number, using ASCII digits U+0030..U+0039 and no punctuation.
Further, to simplify decoding and transport, base16k shall be treated leniently:
- Leading zeroes are allowed for the number of bytes.
- Any non-digit (any character outside U+0030..U+0039) terminates the number of bytes.
- After the number of bytes:
- All characters except U+5000..U+8fff are ignored. This explicitly allows arbitrary spaces and line breaks.
- Only the necessary number of Han characters in the range U+5000..U+8fff indicated by the table above are decoded into binary data.
- Excess bits in the last decoded Han character but beyond the number of bits necessary are ignored.
It is possible to extend the encoding by adding further data fields, for example for a checksum. Such extensions are not essential to the encoding and thus left to higher-level protocols. The leniency of the decoder allows to append such data fields without disturbing the decoder.
Read Unicode characters or UTF-16 code units in memory order and process as follows. (Surrogates in UTF-16 can be ignored because supplementary code points are ignored.)
- Read the number of bytes.
- Start with an initial number of 0.
- While there is a digit U+0030..U+0039, multiply the number value by 10 and add the digit value.
- There must be at least one digit.
- Any non-digit terminates the number. Proceed with step 2.
- Read the binary data.
- Stop if the number of bytes has reached 0. (It is decremented in one of the following steps.)
- Signal an error if there are no more characters.
- Ignore any characters outside of U+5000..U+8fff.
- For a character in the range U+5000..U+8fff, subtract 0x5000 from the code point value to get a value 0..0x3fff.
- Use the initial bits to complete one byte of binary data, together with remaining bits from the previous character if necessary.
- Decrement the number of bytes. Stop if 0.
- If there are 8 more bits available in the current value, then emit another byte of binary data and decrement the number of bytes.
- Keep the remaining value bits for the next iteration.
code points of the Han characters: (readonly)
In theory, base16k works with legacy codepages which contain all of the necessary characters. By design, this is true for GBK/windows-936 and GB 18030. Since they also encode Han characters with 2 bytes each, they would achieve the same efficiency of 87.5%. However, due to the variations of conversion tables and the size of such tables and the performance of table-based conversion, the use of base16k with legacy codepages is not advisable.
- 2004-05-25: First formal write-up.
- 2004-05-10: Refined to current base16k.
- 2004-05-07: Initial idea for a base4k encoding.
[RFC3548] RFC 3548 - The Base16, Base32, and Base64 Data Encodings (http://www.faqs.org/rfcs/rfc3548.html)
Oren Ben-Kiki asking in 2001 whether a "base4096" exists: http://www.geocrawler.com/archives/3/12303/2001/5/0/5850798/
Rick Jelliffe arguing in 1998 that binaries in XML can use something like base64, or "you might want to invent your own Base4K encoding": http://lists.xml.org/archives/xml-dev/199806/msg00378.html
John Cowan replied, proposing base-256 (with U+f000..U+f0ff).