Binary XML

(Moved here unchanged from defunct http://www.mindspring.com/~markus.scherer/unicode/binxml.html)

Markus Scherer 2002-dec-08

This is a concept paper.

Goals

A transformation of XML that represents all XML documents 1:1 but allows for faster, smaller generators and parsers. The binary XML format is intended for machine-to-machine data transfer. XML documents can be transformed 1:1 between the W3C text format and this binary format, preserving document structure and contents but not details of formatting. Essentially, the document structure is preserved in a more efficient form.

Compared with XML, this format adds native binary document values, so that binary data need not be expressed as hex or base64 or similar text.

Limitations

- Binary XML cannot be treated as text, edited with a text editor, parsed with a standard XML parser, etc.
- Binary XML can express DTD declarations and entity references, but it remains to be seen if this is sufficient for a fully validating parser.
- Binary XML can encode true binary data; standard XML does not provide standard syntax for this. Also, binary XML does by itself not restrict the character repertoire for any item, nor enforce any normalization.

High-Level Structure

A binary XML document consists of a well-formed, structured sequence of binary-encoded nodes that represent items of an XML document. Nodes are parsed directly as binary bytes, which avoids passing structure elements through a character converter. Text contents, including names of elements, are encoded in SCSU (which allows compact streams and fast parsing regardless of content language). A fixed, purely algorithmic charset allows for small parser/converter code without encoding ambiguities.

Each variable-length sequence is introduced with its length, eliminating escaping (NCRs) and scanning for terminators. White space is stored exactly as intended as part of the contents and never stripped. There is no white space outside of text contents. XML comments, named entity references and processing instructions can be encoded.

Elements and element attributes are encoded with an index, not a name string. Their names must be defined before they are used. Name definitions can be placed directly before their first use (which allows streaming generation and parsing) or at any valid point earlier in the stream (e.g. to make sure that frequently used names get small indexes). They are auto-indexed, with automatically incrementing index numbers, starting at 1.

A parser must provide the actual name strings to the application, not the index numbers, because different generators may assign the same names to different indexes. There is one combined index and list of names for both elements and their attributes. Names of entities and XML processing instructions and their attributes are not indexed but stored directly as strings.

Note that a trivial SCSU encoder can be implemented that is more compact for normal text than either UTF-8 or UTF-16: Encode Latin-1 in single-byte mode (SC0/UC0), C0 controls and U+FEFF with SQ0, and all non-Latin-1 UTF-16 code units in Unicode mode (two bytes per 16-bit unit; SCU, UQU). The inter-unit state in such an encoder is only a boolean flag for single byte mode vs. Unicode mode. A full SCSU decoder requires a lot more state but is straightforward and compact.

Structure:

document = xml_decl prolog element epilog

xml_decl = processing_instruction # <?xml ...?>

prolog = (processing_instruction | declaration | comment | name)*

epilog = (comment | processing_instruction)*

processing_instruction = node<proc, "instruction name"> text_data # <?name text...text?>

declaration = node<decl, "declaration string"> # declaration: <!...> unstructured string

comment = node<proc, 0> text_data

name = node<name, "element/attribute name">

element = node<elem, element index> (attribute | name)* element_data* node<elem, 0>

attribute = node<att, attribute index> (text_data | binary_data)

text_data = node<txt, "text">

binary_data = node<bin, "sequence of bytes">

entity = node<ent, "entity name">

Low-Level Structure

Each low-level node contains a type code and a number. Some nodes are followed by text or binary data, depending on the node type.

node = type_code (index_number | (length_number contents))

A node is encoded with one lead byte, 0..4 bytes for the number, and a sequence of data bytes depending on the node type. The lead byte contains the type code in bits 7..5. Bits 4..0 and the following 0..4 bytes encode the number. Numbers must be non-negative 32-bit values (0..7FFFFFFF).

Bits in a node (t = type code bits, n = number bits, c = contents bits): tttnnnnn nnnnnnnn{0..4}[cccccccc{number}]

Lengths of text data are numbers of SCSU bytes. A parser sets up an SCSU decoder, in initial state at the beginning of the stream. The SCSU decoder is never reset during parsing. Only text data bytes (from instruction names, name definitions, text data nodes, etc.) are passed consecutively to the decoder, using the Unicode string output from each piece as its parsed value. Each node's text data piece must contain valid SCSU sequences and must end with a complete encoding of a Unicode code point or other complete SCSU command.

Numbers are encoded as big-endian, byte-serialized integers with the fewest number of bytes. The lead byte contains either the number itself or a value indicating the length of the integer byte sequence. The table shows hexadecimal digits.

Example

Note that in hexadecimal notation the high nibble of the node lead byte appears to have a value of twice the type code because the type code occupies only the high-order bits of that nibble. For example, type code 5 for elements results in nibble value A (or B with numbers above 0F).

Issues

- XML namespaces are not handled particularly. Problem? Just always work with long names, with no name space declarations?
- Should binary XML streams have a checksum at the end? Simple 37*m+n hash? CRC?
- How likely may a bit error go undetected?
- Bit errors in text parts result in garbage text and may even destroy all subsequent text because the SCSU stream is not reset. However, if the parser has access to the SCSU decoder state (to make sure that each text part ends in the decoder's "read command" state) then it is likely that bit errors result in illegal states.
- Bit errors in encoded binary XML nodes will almost certainly cause parser errors.
- Standard XML documents do not carry any checksum.
- Similarly, is there a need for a prefix signature with a version number? For as long as XML does not change significantly as far as this format is concerned, there should be little to be changed. The initial node<proc, "xml"> (43 78 6D 6C) might be sufficient for now. A future revision could use a special processing attribute in this <?xml?> header equivalent or prepend another/different processing instruction or binary_data etc.
- There are many ways to encode this structure. One interesting modification may be to have separate name indexes (and separate type codes for name definitions) for elements and attributes (more names whose indexes fit into the node lead byte). For the maximum of 8 type codes as defined here, one other type code must then be dropped: Entities could be encoded as node<proc_att, "entity name"> and distinguished from processing instruction attributes by context, or they could be encoded as node<elem_name, 0> followed by text_data, etc.