Articles‎ > ‎

Parsing ASN.1: A troublesome problem?

Abstract

ASN.1 (Abstract Syntax Notation One) is an international standard which aims at specifying format / structure of data used in telecommunication protocols. It was designed for modeling efficient communication between heterogeneous systems. Hence the description is also known as Abstract Syntax. This language is extremely simple (good news for language users) yet powerful to express complex constructs (an implied burden on compiler designers). A challenging aspect in writing an ASN.1 compiler is the parsing of the input specification itself. Here is a quote from in Section 22.3 of the ASN.1 - Communication between heterogenous systems (by Olivier Dubuisson) - 

If the raw grammar of the ASN.1:1997 standard is analyzed with Yacc (Yet Another Compiler Compiler), the most famous bottom-to-top parser generator (called LALR(1)), it issues 396 shift/reduce conflicts and 1,304 reduce/reduce conflicts. ANTLR (ANother Tool for Language Recognition), a top-down parser generator (called LL(k)), indicates more than 200 grammar productions beginning with the same lexical token (an opening curly braces for example!).

The nature and extent of the grammar of the ASN.1 language is such that traditional LALR(1) bottom-up-parsers capabilities are inadequate. But a variant of the LR parser - GLR / Tomita Parser, provides a mechanism for handling ambiguous grammars which can be tailored for parsing ASN.1 thoroughly. Most LR based parsers for ASN.1 don't quite support the Information Objects (esp. the Defined Syntax) in full or employ a multi-pass parser to achieve it. Here we explore a way to deal with the ambiguities involved in parsing ASN.1 using Bison (which implements the GLR parsing mechanism since v1.75).

At first a multi-pass parser which looks for types in one pass and values in a subsequent pass seems promising. Only until you realize that values can contain types and types can contain values, not to mention the possibilities brought in by the Defined Syntax.

To begin with we will take a look into ambiguities in the ASN.1 language.

Ambiguities

In parsing an ambiguity arises when a given input can be interpreted in multiple ways (i.e. more than 1 valid parse trees) in accordance with the grammar. The ambiguities are lexical or structural in nature which can be resolved contextually. But, the ASN.1 language supports unrestricted forward referencing whereby you ca make use of or reference an element (possibly in another specification file) even before it is defined.

The following e.g. illustrates use of some of the simplest constructs resulting in ambiguous inputs.

value TYPE ::= { iso 10 } -- Value Assignment
.
.
-- Possible Definition of TYPE #1
TYPE ::= OBJECT-IDENTIFIER -- Type Assignment
.
.
-- Possible Definition of TYPE #2
TYPE ::= SEQUENCE { iso INTEGER } -- Type Assignment
.
.
-- Possible Definition of TYPE #3
TYPE ::= SET { iso INTEGER } -- Type Assignment
.
.
-- Class Definition whose Defined-Syntax resembles OBJECT IDENTIFIER types
CLASS-OID ::= CLASS
{
  &val1 INTEGER,
  &val2 INTEGER,
  &val3 INTEGER
}
WITH SYNTAX { &val1 &val2 &val3 }

iso INTEGER ::= 1
member-body INTEGER ::= 2
f INTEGER ::= 250

objectidentifier-like-object CLASS-OID ::= { iso member-body f }

Code Sample 1 : Ambiguity I

Consider only the right-hand-side (RHS) of the Value-Assignment in Sample Code # 1. Without knowledge as to what TYPE is we can easily interpret, rather misinterpret the RHS as an OBJECT IDENTIFIER, SEQUENCE or SET value. And these are ambiguities just involve the common data-types.

In a similar context, comma separated values (one or more in number) within curly braces can represents either of - BIT STRING, SEQUENCE OF, SET OF, Quadruple, Tuple and Defined-Syntax. The extent of the problem is further revealed when we consider the following example which is even more trivial.

value TYPE ::= { iso } -- Value / Object Assignment?
.
-- Possible Definition of TYPE #1
TYPE ::= OBJECT-IDENTIFIER -- Type Assignment
.
-- Possible Definition of TYPE #2
TYPE ::= SEQUENCE OF INTEGER { iso(0), ccitt(1) } -- Type Assignment
.
-- Possible Definition of TYPE #3
iso INTEGER ::= 0
ccitt INTEGER ::= 1
TYPE ::= SEQUENCE OF INTEGER -- Type Assignment
.
-- Possible Definition of TYPE #4
TYPE ::= BIT STRING { iso(0), ccitt(1) } -- Type Assignment
.
.
-- Possible Definition of TYPE #5
TYPE ::= CLASS { &value INTEGER } WITH SYNTAX { &value } -- Class Assignment

Code Sample 2 : Ambiguity II


In the case of Sample Code # 2. each ambiguity arises from the possible interpretation of the RHS of the assignment as either a value or an object. The interpretation will depend on how TYPE  is defined later which is unknown while parsing the assignment.

The number of ambiguities we can enumerate this way is plenty given the expressive power of the ASN.1 language. One of the significant hurdles is the Defined-Syntax which enables th user to specify the format to be used for definition of object of a class.

The Defined Syntax

-- Class Definition
CALLCONTROLOBJECTCLASS ::= CLASS
{
&ArgumentType OPTIONAL,
&argumentTypeOptional BOOLEAN OPTIONAL,
&objectClassIdentifier OBJECT IDENTIFIER UNIQUE
}
WITH SYNTAX
{ -- The Sentence defined by the user
[ARGUMENT &ArgumentType
[OPTIONAL &argumentTypeOptional] ]
IDENTIFIER &objectClassIdentifier
}
...
-- Object Definition
directCallAssociation CALLCONTROLOBJECTCLASS ::=
{
ARGUMENT SEQUENCE { remotePEPId ObjectReferenceId }
IDENTIFIER { ccObjectClasses 4 }
}

ObjectReferenceId ::= INTEGER (-2147483648..2147483647) -- 4 octets

ccObjectClasses OBJECT IDENTIFIER ::=
{ itu-t recommendation q 2981 cc-object-classes(6) }

Code Sample 3: Defined Syntax Illustration

One of the most challenging aspects of the ASN.1 language is the so called Defined-Syntax also known as User-Friendly Syntax or Variable-Syntax. The intent of this construct is to enable the user to write specification which is understandable for humans by enabling near English like syntax. The sentence is specified with placeholders that shall be filled with ASN.1 types / values / objects when constructing an object. More technically put, this sentence is a series of words (and comma punctuator) interspersed with references to fields of the class.

The e.g. to the left illustrates the typical usage.

Although Defined-Syntax is user friendly it is powerful enough to specify syntaxes (or sentences) which resemble the syntax of BuiltInTypes withing curly braces. These being -

a. One or more white-space saperated values. (Without the context these can resemble an OBJECT IDENTIFIER value.)

b. One or more comma separated values. (resembling SEQUENCE OF or SET OF value.)

c. One or more comma separated values. (resembling SEQUENCE or SET value.)

The following e.g. gives a clearer picture of class definitions that can resemble values of BuiltinTypes -

-- Class Definition whose Defined-Syntax resemble REAL
-- (and potentially any SEQUENCE/SET) value
CLASS-REAL ::= CLASS
{
    &mantissa INTEGER,
    &val-mantissa INTEGER,
    &base INTEGER,
    &val-base INTEGER,
    &exponent INTEGER,
    &val-exponent INTEGER
}
WITH SYNTAX
{ -- The Sentence defined by the user
    &mantissa &val-mantissa,
    &base &val-base,
    &exponent &val-exponent
}

mantissa INTEGER ::= 1
base INTEGER ::= 10
exponent INTEGER ::= 100

real-like-object CLASS-SEQSET ::= { mantissa 20, base 2, exponent 5 }
.
.
-- Class Definition whose Defined-Syntax resemble SEQUENCE/SET OF types
CLASS-SEQSETOF ::= CLASS
{
    &val1 INTEGER,
    &val2 INTEGER,
    &val3 INTEGER
}
WITH SYNTAX { &val1, &val2, &val3 }

seqof-like-object CLASS-SEQSETOF ::= { 10, 20, 30 }


The ambiguities mentioned here are not necessarily an exhaustive list and only serve as cues to help identify more of such kind. It can be observed that the frequency of occurrence of these ambiguous inputs in real world situations is quite less.

The standard by itself deprecates (but does not prohibit) such complex usage of Defined-Syntax while the ambiguities arising from other constructs are a rarity given to good programming practices. Not surprisingly and sadly, many of the commercially available compilers do not handle them gracefully.

Disambiguation using GLR parser

Since version 1.75, Bison began support for the GLR (Generalized LR) parsing mechanism a.k.a. Tomita's parser. The GLR parser splits when a shift/reduce or reduce/reduce conflict is encountered. As the parser proceeds in parallel, the path that leads to a parse error first dies. This way GLR parsing functions as an LR parser with variable number of lookaheads. (The temporal explosion of parser stack(s) is acceptable in comparison with the size of table required for an LR(n) parser of fixed n.). A vast majority of conflicts in the grammar are thus handled without any work on the part of the programmer.

But what if at a certain point these alternate stacks converge?  Cases where there are two or more valid parse trees like with the examples shows. Which of the two parse trees should be acted upon? This situation is reported as an ambiguity since the parser does not know as to how to arbitrate them? When an ambiguous construct is identified (i.e. multiple parses for an input is identified), bison looks for what is termed as dynamic precedence to resolve the contention. This is similar to how shift/reduce or reduce/reduce conflicts are resolved using %prec. The deferred semantic action is invoked on the alternative with the highest precedence.

Alternately the user can specify a merge function (using %merge) to act on the resultant semantic value of both. This can prove useful if ambiguities are scarce. Given the possible number of ambiguities in ASN.1 we will resort to giving precedence to the simplest construct and later converting this to the correct form based on the context, much like how we would it manually.

Consider Sample Code # 1 for example. In this case of contention, OBJECT IDENTIFIER is given the highest dynamic precedence. Before semantic analysis is performed the RHS is converted to the type of TYPE on the LHS - in case #2 into SEQUENCE's and in case #3 SET's AST. In case of Sample Code # 2 with similar precedence the conversions are as follows - case #2/3 to SEQUENCE OF, case #4 to BIT STRING and in case #5 to an object of the class. The thumb-rule for associating dynamic precedence is - ASN.1 types whose AST is simple in representation and easier to convert from to the other representations gets higher precedence.

Interpreting Defined Syntax

Ambiguities are just one part of the problem of parsing an ASN.1 specification. With good familiarity of the language we quickly realize that the Defined-Syntax can be mapped uniquely to a Default-Syntax equivalent, which is more tenable for processing by subsequent phases of the compiler. Since the AST representation for Defined-Syntax is least rigid in structure we give it the least dynamic precedence. Thus arbitrating ambiguities introduced by the Defined-Syntax is just an extension of what we do for other ambiguities.

-- Class Definition
INVALIDCLASS ::= CLASS
{
  &val1 INTEGER OPTIONAL,
  &val2 REAL OPTIONAL
}
WITH SYNTAX
{ -- The Sentence defined by the user
   OBJECT [V &val1] [V &val2]
}
...

object INVALIDCLASS ::= { OBJECT V 0 }


Defined-Syntax involves describing the English like syntax for an object of the class. The restrictions (Ref: ITU-T X.681 Clause 10.12) on the new syntax clearly prohibits any ambiguity. The presence of Literals (WORD or comma) and PrimitiveFieldNames (for e.g. &val) is mandated in such a manner that at any point either a Setting (a Type / Value / Class / Object / Value-Set / Object-Set) corresponding to a definite FieldName or one amongst a set of distinct possible Literals is expected.

The code to the right exemplifies a violation of the restriction. It is not clear from the object definition as to if the >Setting '0' corresponds to the field &val1 or &val2.

Restriction of such characteristics enables utilizing a DFA (Deterministic Finite Automata) to validate and perform translation of the input without ambiguities. In the case of ASN.1 the restrictions translate to - for any state in the DFA outgoing labels can have utmost one Setting and any number of distinct Literals.

Consider the following class definition and the DFA corresponding to the same.

-- Class Definition
VALIDCLASS ::= CLASS
{
  &TYPE OPTIONAL,
  &val1 REAL OPTIONAL,
  &val2 &Type OPTIONAL,
  &ValSet INTEGER OPTIONAL,
}
WITH SYNTAX
{ -- The Sentence defined by the user
   [ [&Type] TOK-A [&val1] ] TOK-B [ [&val2] TOK-C [&ValSet] ]
}
.

.

two INTEGER ::= 2
object VALIDCLASS ::= { TOK-A 10 TOK-B two TOK-C { 10 | two } }


Look below for the DFA for the above class definition. The run for each object of the class begins at State-A and accepts the input at State-I.

Although in this case it was quite easy to formulate the DFA by hand, constructing one directly from the English like syntax is not so straightforward. Optional groups essentially represent ε-transitions which are not allowed in a DFA. But, constructing a NFA (Non-eterministic Finite Automata) from the English like syntax is quite trivial with one additional restriction to prohibit any ambiguity (that no two edges from a state will take the same input). Once there it is just a matter of using the Powerset construction method to build the DFA out of it. Note: Still any other restrictions such as Clause 10.12.a ITUT- X.681 which are not implied by a FA need to be verified before construction of the NFA.

An advantage of using the FA approach is that we can precisely point as to what was expected when we encounter an ill formed object definitions using the Defined-Syntax. Further a non-conforming object will still let us parse the remaining unscathed.

Most important of all, the Setting which is specified as part of the object definition could in-turn be an ambiguous construct which requires disambiguation. The FA method yields well to handling these in a recursive fashion. The GLR parsing and the above method combined lets us build an ASN.1 parser that can handle all of the complexities with no known deficiencies.

Conclusions and future directions

One of the downside of dealing with ambiguities this way is that the ambiguous constructs could be very long resulting in parser stack bloat. In such consideration it is better to alter the grammar to remove ambiguities, although it would be quite a task to prove its completeness & correctness.

On a less related note LR parsers are not known to have a error recovery mechanism as good as a LL parser. Even if it can parse it all, the error recovery could be much less effective when compared to that of LL's.

There is a widely accepted notion that LR parser does not suit well for parsing ASN.1. Having seen what we can do with a GLR parser we no longer have to resort to an LL grammar (not to mention the arduous transformation involved) owing to inadequacies of a LR parser.

References

Comments