Parsing ASN.1: A Troublesome Problem?


__/~\_/~\__
 »»»»»»»»»»»»»»»          (== | ^ ==)

Quicklinks


Abstract

ASN.1 (Abstract Syntax Notation One) is an international standard which aims at specifying format / structure of data used in telecommunication protocols. It was designed for modeling efficient communication between heterogeneous systems. Hence the description is also known as Abstract Syntax. This language is extremely simple (good news for language users) yet powerful to express complex constructs (an implied burden on compiler designers).

An ASN.1 compiler is required to read the source specification and generate appropriate data structures for in-memory representations and methods for encoding (from a value) / decoding (from an encoded byte stream). A major challenging aspect of building an ASN.1 compiler is the parsing of the input specification itself (as explained in Section 22.3 of the ASN.1 - Communication between heterogenous systems by Olivier Dubuisson).

The nature of the grammar of ASN.1 language is such that traditional LALR(1) bottom-up-parsers capabilities are insufficient. Although using an LL(1) seems propitious, it requires performing a long transformation leading to redundancy and plausibly a bloated grammar. Not to mention that ambiguities would still need to be handled. The elegance of a variant of the LR parser - Tomita Parser, on the other hand entailed only making trivial changes to the grammar yet providing a framework for handling of ambiguities ;).

In a quest to parse ASN.1 specification using a bottom-up-parser, we set out to understand the characteristics of the language that makes it less yielding for use in the Bison (GNU counterpart of Yacc) parser generator. Or is it? In the process, we will investigate a surprisingly simple and intuitive technique that can greatly simplify handling of any language with ambiguous constructs. The plausibility of a quicker solution conducive for incremental development led us to investigate the LR way. The prevalent misconception on unavailability of an LR based solution in public domain has been yet another driving factor for the investigation.

Introduction

A compiler interprets a program written in a ‘source language’, to translate it into the ‘target language’. The interpretation of the source language is commonly delegated to the Front-End of the compiler. Unfortunately, the grammar of ASN.1 notation has a structure (use of curly brackets for different concepts, no semicolon at the end of a definition...) which inherently proves quite complicated to deal with when it comes to programming ASN.1 parsers.

It is a common practice to employ parser generators in place of hand written parsers - the maintainability and scalability of parser being the main criteria for the same. Yacc (Yet Another Compiler Compiler) and its GNU counterpart Bison are the most famous and widely adopted bottom-up parser generators. When the raw grammar of the ASN.1:1997 standard is analyzed with Yacc, it issues 396 shift/reduce conflicts and 1,304 reduce/reduce conflicts.[1] (In the upcoming text, any reference to a parser indicates the use of code obtained by use of the Bison Parser Generator.)

Bison expects the grammar be specified in BNF format and that it complies with LALR(1). The Look Ahead Left Recursive (1) nature of bison mandates that at any stage during parsing, the algorithm can clearly decide on the appropriate operation (shift/reduce) with the knowledge of the upcoming token alone. As we will see, the ASN.1 language requires much more than one look-ahead token at times. Considering the varied number of look-aheads in different contexts, just a LALR(>1) is not a sufficient solution either. Quite often the conflicts can be alleviated by use of precedence specification or reorganization of the grammar. This holds particularly well when the grammar is unambiguous.

The term “Ambiguous” is used to indicate the possibility of interpreting a given input in more than one valid way in concordance with the grammar, i.e. an input with more than one parse trees. The ambiguities encountered such are usually lexical or structural ambiguities, which can be resolved contextually. This is analogous to words in English language the interpretation of which is subject to the context of use.

Even though Languages like C++ have such ambiguities in the raw grammar, certain properties of the language help in resolving the same – such as the mandate to declare an entity in C++ before its usage. Thus any ambiguity can be resolved immediately as the context of use is well defined (by the content parsed as yet). Unlike most languages, the ASN.1 language permits forward referencing, i.e. usage of / reference to an element (possibly in another specification file) even before it is defined. This language feature mandates that the entire input be parsed and the references resolved in a secondary phase.

Say “Hello” to Ambiguities!

The following e.g. illustrates a simple case of ambiguity encountered while parsing.

value TYPE ::= { iso 10 }                          -- Value Assignment
.
.
-- Possible Definition of TYPE #1
TYPE ::= OBJECT-IDENTIFIER                -- Type Assignment
.
.
-- Possible Definition of TYPE #2
TYPE ::= SEQUENCE { iso    INTEGER }   -- Type Assignment
.
.
-- Possible Definition of TYPE #3
TYPE ::= SET { iso    INTEGER }              -- Type Assignment
.
.
-- Class Definition whose Defined-Syntax resembles OBJECT IDENTIFIER types
CLASS-OID ::= CLASS
{
  &val1  INTEGER,
  &val2  INTEGER,
  &val3  INTEGER
}
WITH SYNTAX { &val1 &val2 &val3 }

iso INTEGER ::= 1
member-body INTEGER ::= 2
f INTEGER ::= 250

objectidentifier-like-object CLASS-OID ::= { iso member-body f }

Code Sample 1: Ambiguity I

Consider only the right-hand-side (RHS) of the Value-Assignment.  Without knowledge as to what TYPE is we can easily interpret, rather misinterpret ;) the RHS as an OBJECT IDENTIFIER, SEQUENCE or SET value. In a similar context, comma separated values (one or more in number) within curly braces can represents either of - BIT STRING, SEQUENCE OF, SET OF, Quadruple, Tuple and Defined-Syntax. The extent of the problem is further revealed when we consider the following example which is even more trivial.

value TYPE ::= { iso }                                                     -- Value Assignment
.
.
-- Possible Definition of TYPE #1
TYPE ::= OBJECT-IDENTIFIER                                       -- Type Assignment
.
.
-- Possible Definition of TYPE #2
TYPE ::= SEQUENCE OF INTEGER { iso(0), ccitt(1) }        -- Type Assignment
.
.
-- Possible Definition of TYPE #3
iso INTEGER ::= 0
ccitt INTEGER ::= 1
TYPE ::= SEQUENCE OF INTEGER                                   -- Type Assignment
.
.
-- Possible Definition of TYPE #4
TYPE ::= BIT STRING { iso(0), ccitt(1) }                           -- Type Assignment
.
.
-- Possible Definition of TYPE #5
TYPE ::= CLASS { &value    INTEGER } WITH SYNTAX { &value }              -- Class Assignment

Code Sample 2: Ambiguity II

We can broadly classify each of the above cases into two possible interpretations - namely as an object or value representation (with some ambiguity within each). Let us assume that TYPE has not yet been parsed (possibly imported from another module). Since TYPE is a valid Type-Reference and Object-Class-Reference, the assignment could possibly be a Value-Assignment or an Object-Assignment 8-|.

The number of ambiguities we can enumerate this way is plenty given the expressive power of the ASN.1 language. One of the significant hurdles is the Defined-Syntax which enables the user to specify the format to be used for the object of a class.

The Infamous “Mr.Defined Syntax”

One of the most challenging aspects of the ASN.1 language is the so called Defined-Syntax also known as User-Friendly Syntax or Variable-Syntax. The intent of this construct is to enable the user to write specification which is understandable for humans by enabling near English like syntax. The sentence is specified with placeholders that shall be filled with ASN.1 types / values / objects when constructing an object. More technically put, this sentence is a series of words (and comma punctuator) interspersed with references to fields of the class. The following e.g. shall illustrate the typical usage.

.
.
-- Class Definition
CALLCONTROLOBJECTCLASS ::= CLASS
{
  &ArgumentType  OPTIONAL,
  &argumentTypeOptional  BOOLEAN OPTIONAL,
  &objectClassIdentifier     OBJECT IDENTIFIER UNIQUE
}
WITH SYNTAX
{                                                             -- The Sentence defined by the user
  [ARGUMENT &ArgumentType
    [OPTIONAL &argumentTypeOptional]  ]
  IDENTIFIER &objectClassIdentifier
}
.
.
-- Object Definition
directCallAssociation CALLCONTROLOBJECTCLASS ::=
{
  ARGUMENT   SEQUENCE {  remotePEPId  ObjectReferenceId  }
  IDENTIFIER  { ccObjectClasses  4 }
}

ObjectReferenceId ::= INTEGER (-2147483648..2147483647) -- 4 octets

ccObjectClasses OBJECT IDENTIFIER ::=  { itu-t recommendation q 2981 cc-object-classes(6) }
.
.

Code Sample 3: Defined Syntax Illustration

Although Defined-Syntax is user friendly it is powerful enough to specify syntaxes (or sentences) which resemble the syntax of BuiltInTypes withing curly braces. These being -
   a. One or more white-space saperated values. (Without the context these can resemble an OBJECT IDENTIFIER value.)
   b. One or more comma separated values. (resembling SEQUENCE OF or SET OF value.)
   c. One or more comma separated values. (resembling SEQUENCE or SET value.)

The following e.g. gives a clearer picture of class definitions that can resemble values of BuiltinTypes -  

.
.
-- Class Definition whose Defined-Syntax resemble REAL (and potentially any SEQUENCE/SET) value
CLASS-REAL ::= CLASS
{
  &mantissa       INTEGER,
  &val-mantissa  INTEGER,
  &base             INTEGER,
  &val-base        INTEGER,
  &exponent       INTEGER,
  &val-exponent INTEGER
}
WITH SYNTAX
{                                                             -- The Sentence defined by the user
  &mantissa  &val-mantissa,
  &base        &val-base,
  &exponent  &val-exponent
}

mantissa INTEGER ::= 1
base INTEGER ::= 10
exponent INTEGER ::= 100

real-like-object CLASS-SEQSET ::= { mantissa 20, base 2, exponent  5 }
.
.
-- Class Definition whose Defined-Syntax resemble SEQUENCE/SET OF types
CLASS-SEQSETOF ::= CLASS
{
  &val1  INTEGER,
  &val2  INTEGER,
  &val3  INTEGER
}
WITH SYNTAX { &val1, &val2, &val3 }

seqof-like-object CLASS-SEQSETOF ::= { 10, 20, 30 }

The ambiguities mentioned here are not necessarily an exhaustive list and only serve as cues to help identify more of such kind. It can be observed that the frequency of occurrence of these ambiguous inputs in real world situations is quite less. The standard by itself deprecates (but does not prohibit :)) such complex usage of Defined-Syntax while the ambiguities arising from other constructs are a rarity given to good programming practices. Not surprisingly and sadly, many of the commercially available compilers do not handle them gracefully :(.

GLR tersely put

Since version 1.75, Bison began support for GLR (Generalized LR) parsing mechanism also known as Tomita's parser. Per this mechanism the parser splits when a shift/reduce or reduce/reduce conflict is encountered. As the parser proceeds in parallel, the path that leads to a parse error first dies. This way GLR parsing functions as an LR parser with variable number of lookaheads :);. (The temporal explosion of parser stack(s) is acceptable in comparison with the size of table required for an LR(n) parser.) But what if at a certain point both the stacks become identical 8-|? This situation is reported as an ambiguity since the parser does not know as to how to arbitrate them?

When an ambiguous construct is identified (i.e. multiple parses for an input is identified), bison looks for what is termed as dynamic precedence to resolve the contention. This is similar to how shift/reduce or reduce/reduce conflicts are resolved using %prec. The deferred semantic action is invoked on the contender, with a greater precedence.

Disambiguating the miscreants

Ambiguities arise as a result of lack of knowledge of the context. The most intuitive solution seems to be identifying the context before parsing such constructs. The next logical implication is the use of a multi-pass parser which would identify the context during one run and subsequently parse with the context information. Had types in ASN.1 been pure type definitions this would have been an easy option. But with types themselves holding values (in the form of DEFAULT values, constraints etc.) this is easier said than done given that the complex combinations are hard to enumerate.

The catch to using Bison generated GLR parser is as follows. In case of an ambiguity we give precedence to the contender who's AST (Abstract Syntax Tree) is easiest to convert to the others. The entire AST is built based on the above thumb-rule in a single pass. Before semantic analysis is performed an attempt is made to promote these ambiguous constructs to their correct interpretation. Thus disambiguation entails conversion of the AST of one of the contender (with highest precedence) to its correct form.

Consider Code Sample 1 for example. In this case of contention, OBJECT IDENTIFIER is given the highest dynamic precedence. Before semantic analysis is performed the RHS is promoted to the type of TYPE on the LHS - in case #2 into SEQUENCE's and in case #3 SET's AST. In case of Code Sample 2 with similar precedence the promotions are as follows - case #2/3 to SEQUENCE OF, case #4 to BIT STRING and in case #5 to an object of the class. The thumb-rule for associating dynamic precedence is - ASN.1 types whose AST is simple in representation and easier to convert to the rest gets higher precedence.

It may be noted that this method enables handling ambiguities as they are discovered; Incremental Development i.e. :). All that is required is to write semantic actions for building the parse tree (required even for unambiguous languages) and allotting dynamic precedence to potential contenders. With promotions performed before the semantic stage, the rest is same as it would have been for a language without ambiguities.

Bison also permits handling all contenders (using %merge) during an ambiguity! The prospective solution involves creation of all parse trees and making a choice between them later. This is conspicuously not a neat solution given to the amount of stray AST's that would be generated and destroyed.

Making sense of objects using Defined-Syntax

Ambiguities are just one part of the problem for ASN.1 parsers. With good familiarity of the language it is a no-brainer to realize that the Defined-Syntax usage maps uniquely to a Default-Syntax usage, which is more preferable for processing by subsequent phases of the compiler. Thus the front-end of the ASN.1 compiler is usually taxed with the task of translating it from a User-Friendly format to a compiler-friendly one (not necessarily the Default-Syntax). The reason for discussing this process in context of parsing is the fact that the disambiguation and the translation of Defined-Syntax are intertwined.

First we discuss on the resolution of ambiguities introduced by the Defined-Syntax. The Defined-Syntax has much less rigid a form compared to BuiltInTypes and is very likely to have a more complex AST. Sticking to our thumb-rule of giving higher precedence to a contender with a simpler AST, we give lower precedence to Defined-Syntax during contention. This entails performing promotions from certain BuiltInTypes (when ambiguous) to Defined-Syntax before semantic analysis begins.

Defined-Syntax involves describing the syntax for an object of the class. With careful analysis it maybe observed that the restrictions (Ref: ITU-T X.681 Clause 10.12) on the new syntax is equivalent to that commonly used in specifying command line syntaxes. The presence of Literals (WORD or comma) and PrimitiveFieldNames is mandated in such a manner that at any point either a Setting (value) corresponding to a definite FieldName or one amongst a set of distinct possible Literals is expected. The following code exemplifies violation of a restriction.

.
.
-- Class Definition
INVALIDCLASS ::= CLASS
{
  &val1  INTEGER  OPTIONAL,
  &val2  REAL        OPTIONAL
}
WITH SYNTAX
{                                                             -- The Sentence defined by the user
  OBJECT [V &val1] [V &val2]
}
.

.

object INVALIDCLASS ::= { OBJECT V 0 }

Restriction of such characteristics enables using a Deterministic Finite Automaton to validate and perform translation of the input without ambiguities. I.e. for any state in the DFA outgoing labels can have utmost one Setting and any number of Literals. Consider the following class definition and the DFA corresponding to the same.

.
.
-- Class Definition
VALIDCLASS ::= CLASS
{
  &TYPE OPTIONAL,
  &val1     REAL        OPTIONAL,
  &val2     &Type       OPTIONAL,
  &ValSet  INTEGER  OPTIONAL,
}
WITH SYNTAX
{                                                             -- The Sentence defined by the user
  [ [&Type] TOK-A [&val1] ] TOK-B [ [&val2] TOK-C [&ValSet] ]
}
.

.

two INTEGER ::= 2
object VALIDCLASS ::= { TOK-A 10 TOK-B two TOK-C { 10 | two } }

The DFA for the above class definitions is illustrated below. A run for each object of the class begins at State-A and accepts the input at State-I.

Although it is quite easy to formulate the DFA by hand, constructing it from the new syntax isn't obvious when optional groups are present. Optional groups essentially represent e-transitions which are not permitted in a DFA. It is a trivial task to construct a NFA from the new syntax (with only e-transitions but no two outward edges on the same input) and hence convert it into a DFA using the Powerset construction method. It must be noted that restrictions such as Clause 10.12.a ITU-T X.681 which are not implied by a FA are to be verified before construction of the NFA.

One of the advantages of using the FA method is that when the user goes amiss in constructing an object, we can precisely state what is expected just by virtue of the current state of the DFA. Secondly, non-compliance of the object with the syntax does not affect parsing the rest. Since any state in our DFA has utmost one outgoing transition labeled Setting, promoting it in case of ambiguity will be determined by the PrimitiveFieldName that corresponds to it.

Conclusions and Future Directions

There is a widely accepted notion that traditional LR is not a propitious means of parsing an ambiguous language as ASN.1. Hand written parsers and LL parsers are more preferred. With the newer generalized (parallel) parsing mechanism it is quite possible to parse ASN.1 using an LR(n) based parser with ease. The greatest incentive to doing this is that the grammar can be kept intact (as specified by the standards) without requiring much transformation (as required by LL grammar). With the GLR parsing mechanism supported by Bison, handling of ambiguities in ASN.1 can be performed very easily. The above solution proposed for handling of ambiguities and the Defined-Syntax requires a bit of infrastructure to be laid, but facilitates incremental development. This clearly demonstrates that no longer is the LR mechanism shaky in handling languages with complex constructs such as ASN.1.

One of the main advantages of the LL method seems to be its ability to recover from parse errors easily.[1] Bison's GLR parsing mechanism's error recovery has not been well tried upon during the above exercise. Per Bison's manual the behaviour of YYERROR which is used to initiate error recovery, is undefined when subject to deferred semantic actions. I.e. the error recovery mechanism is NOT deterministic when the parsing is non-deterministic ;).

As much as ambiguities can be handled by splitting the parser, there is plausibility that certain ambiguous constructs (aggregates) to be potentially of infinite length. When there is an ambiguity with such a construct, the parser stack(s) grow enormously. In such cases it is worthwhile considering if certain modifications to the grammar can thwart these situations. There are many cases where the number of look-aheads is fixed at a maximum (usually single digit) and hence limiting the potential growth withing acceptable norms. But there are certain recursive constructs where the bloat seems to be noticeable. It is worthwhile investigating into these and proposing suitable alterations to the grammar.

With GLR parsing mechanism, it is quite possible to implement LR parsers (such as the one generated by Bison) for ASN.1 and language of similar complexities, albeit it is unclear to the extent recovery from parse errors can be achived. The performance of the above method is of concern in certain situation and remains to be explored.

References


[1] As stated in Olivier Dubuissions Book