References‎ > ‎

Lexer Rules


Here are the rules that you need to follow when implementing the lexer for your parser.

Tokens

Most parser generators rely on an external lexer to produce tokens. For example, yacc is often used with lex. In contrast, your parser generator will integrate a lexer. The benefit is a huge notational convenience. With yacc/lex, the lexer specification defines the tokens:

%{ enum { PLUS, MINUS, DIV, MULT }; %} %% '+' { return PLUS; } '-' { return MINUS; } ...

and a separate syntactic specification then refers to these tokens:

E : E PLUS E | E MINUS E ;

We made a different design decision.  Our grammar specification will contain a single file that will also be more readable:

   E -> E '+' E
     | E '-' E
     ;

Thus, our token definitions are inlined with the grammar.  These are the terminals of the grammar.  They can be specified as strings (e.g., '+') , or regexes, (e.g., /[0-1]+/).

Lexical Disambiguation

Recall from lecture that the lexer breaks up its input string into lexemes. It assigns each lexeme a corresponding token and returns a sequence of (token, lexeme)pairs. This is pretty straightforward but what if there are multiple ways to partition the input string into lexemes? We clearly need an unambiguous definition. 

We need to be a bit more precise here.  Formally, the lexer is given a lexical specification in the form of a set of regular expression-token pairs {(re1, t1), ..., (ren, tn)}. Your parser will extract the reis from the grammar AST.  The ti values are invented by the parser to uniquely identify each token. Forthermore, the tis values are ordered: if rei preceded rein the grammar file, then ti<tj. This will be used for disambiguation, see below.

Each rei describes a class of lexemes, such as the set of legal identifiers. Given the input string s, the lexer finds the largest prefix p of s that matches the regular expression

     re1 | re2 | ... | ren

If the largest matching prefix, p, matches only one of reis, no ambiguity can arise.  Othwerwise, the token is selected with a tie-breaking rule: if rei is the first regular expression that matches p (i.e., no expression rejj < i, matches the prefix), then the lexer returns the pair (ti, p). The process then repeats on the remainder of input. When the end of file is reached, a special end-file token is returned.

Note that this lexer definition introduces two disambiguation rules:

  1. it requires the lexer to perform the maximal munch, and
  2. it breaks ties by prioritizing earlier regular expressions.

You should convince yourself that these rules eliminate all lexical ambiguity, in the sense that for any legal input string, there is only one sequence of token-lexeme pairs that the lexer is allowed to return.

To make the lexer more suitable for practical tasks, we introduce a few extensions:

  1. The declaration %ignore re instructs the lexer to discard the lexeme matching the regular expression re. That is, the lexeme is not going to receive an edge in the initial parse graph.  It is interesting to note that the effect is not the same as if the lexeme was deleted from the input string before parsing, as that transformation would adjoin the lexeme before and after the deleted lexeme, which could result in different lexing.

    For the purpose of breaking ties, this regular expression has higher precedence than any other regular expression. This declaration can be used to ignore whitespace and comments. See also this footnote: 

  2. When no regular expression matches any prefix of the remaining input, the lexer should notify the parser that a lexical error has occurred. In Milestone 1, your recognizers should return False if there is a lexical error. In further steps, your parser will instead print an error message.

Example

To illustrate the precedence rules, consider the following contrived grammar.  The grammar defines five (lexeme,tokens) pairs with the precedence shown on the right. 

Example Application of Lexical Precedence Rules:

  %ignore      /[ \t]+/
  %ignore      /#.*/
  %%
  S   -> Num | 'if' | Id ;
  Num -> /[0-9]+/ ;
  Id  -> /[a-z]+/ ;
1. [ \t]+ ignored
2. #.* ignored
4. if  
5. [0-9]+  
6. [a-z]+  

Equivalent Terminals

Terminals that produce equivalent regular expression objects are considered to be the same terminal. Two regular expression objects re1re2 are equivalent if in Python, re1.pattern == re2.pattern is True.

For example, in the following grammar:

      %%
      A -> B '+' ;
      B -> '-' C ;
      C -> '+' | /\+/ ;

There are only two token patterns: '\+' and '\-'. The pattern '\+' has higher precedence than '\-'.

You are not required to handle regular expression symbol "$" as it provides needless complexity.  That is, assume that the regular expressions that your parser will seee on the input do not contain a $ character.

%ignore

%ignore is intended for stripping out parts of input that are *difficult* to strip using grammar productions.  Two examples:

  - white space

  - comments

For example, if you wanted your expression grammar to ignore whitespace and comments, you'd have to write a grammar like this:

    E --> whitespace E whitespace '+' whitespace E | ...
    whitespace --> /\s+|<comments>/

Not impossible but ugly because the whitespace would be everywhere.  You are likely to miss to add it here and there, and that would cause endless debugging.

You should not use %ignore for delimiters, such as ';' after statements, as these can appear in only one place in the grammar, unlike the whitespace.  In fact, depending on the syntax of language, you may not be able to %ignore the delimiters -- the parser needs them for disambiguation.  Imagine you remove semicolon from between these two statements:

    x;(x)
you change two expressions
    E ; ( E )

into a function call
    E ( E )

FAQ (from fa09, when students were implementing the lexer):

Q: Do ignore operations always have higher priority than normal non-terminals, or is it only in the case of a tiebreaker? For instance, if you have the grammar

 %ignore /ab/
 %%
 S -> Id ;
 Id -> /[a-z]*/ ;

and you run the lexer on the input 'abcd', will it return an Id token with the lexeme 'abcd', or with the lexeme 'cd'?

A: An excellent question.  I reread our spec and it is indeed ambiguous in this respect.

Both alternatives would be acceptable because in practice this problem is not very likely to arise --- you rarely want to ignore a prefix of a legal lexeme, which is when our spec is ambiguous.  Hence the semantic difference between the two alternatives is not likely to show up, and so you could in principle pick any of them.

But let's not pick just any of them.  When faced with a choice, it is better to select an option that adds the smallest number of exceptions (ie corner cases) and is thus clearer to the programmer.

The simplest semantics for %ignore is that it defines tokens like any other (with the same disambiguation rules), except when they are recognized, they are thrown out.

This is what you should implement.  This is what our parser does.

Comments