Here are the rules that you need to follow when implementing the lexer for your parser.
Tokens
Most parser generators rely on an external lexer to produce tokens. For example, yacc
is often used with lex
.
In contrast, your parser generator will integrate a lexer. The benefit
is a huge notational convenience. With yacc/lex, the lexer specification
defines the tokens:
%{
enum { PLUS, MINUS, DIV, MULT };
%}
%%
'+' { return PLUS; }
'-' { return MINUS; }
...
and a separate syntactic specification then refers to these tokens:
E : E PLUS E
| E MINUS E
;
We made a different design decision. Our grammar specification will contain a single file that will also be more readable:
Thus, our token definitions are inlined with the grammar. These are
the terminals of the grammar. They can be specified as strings (e.g.,
'+') , or regexes, (e.g., /[0-1]+/).
Lexical Disambiguation
Recall from lecture that the lexer breaks up its input string into
lexemes. It assigns each lexeme a corresponding token and returns a
sequence of (token, lexeme)pairs. This is pretty
straightforward but what if there are multiple ways to partition the
input string into lexemes? We clearly need an unambiguous definition.
We need to be a bit more precise here. Formally, the lexer is given a
lexical specification in the form of a set of regular expression-token
pairs {(re1, t1), ..., (ren, tn)}. Your parser will extract the reis from the grammar AST. The ti values are invented by the parser to uniquely identify each token. Forthermore, the tis values are ordered: if rei preceded rej in the grammar file, then ti<tj. This will be used for disambiguation, see below.
Each rei describes a class of lexemes, such as the set of legal identifiers. Given the input string s, the lexer finds the largest prefix p of s that matches the regular expression
re
1
|
re
2
| ... |
re
n
If the largest matching prefix, p, matches only one of reis, no ambiguity can arise. Othwerwise, the token is selected with a tie-breaking rule: if rei is the first regular expression that matches p (i.e., no expression rej, j < i, matches the prefix), then the lexer returns the pair (ti, p). The process then repeats on the remainder of input. When the end of file is reached, a special end-file token is returned.
Note that this lexer definition introduces two disambiguation rules:
- it requires the lexer to perform the maximal munch, and
- it breaks ties by prioritizing earlier regular expressions.
You should convince yourself that these rules eliminate all lexical
ambiguity, in the sense that for any legal input string, there is only
one sequence of token-lexeme pairs that the lexer is allowed to return.
To make the lexer more suitable for practical tasks, we introduce a few extensions:
- The declaration
%ignore re
instructs the lexer to discard the lexeme matching the regular expression re
. That is, the lexeme is not going to receive an edge in the initial parse graph. It is interesting to note that the effect is not the
same as if the lexeme was deleted from the input string before parsing,
as that transformation would adjoin the lexeme before and after the
deleted lexeme, which could result in different lexing.
For the purpose of breaking ties, this regular expression has higher
precedence than any other regular expression. This declaration can be
used to ignore whitespace and comments. See also this footnote:
- When no regular expression matches any prefix of the remaining
input, the lexer should notify the parser that a lexical error has
occurred. In Milestone 1, your recognizers should return
False
if there is a lexical error. In further steps, your parser will instead print an error message.
Example
To illustrate the precedence rules, consider the following contrived
grammar. The grammar defines five (lexeme,tokens) pairs with the
precedence shown on the right.
Example Application of Lexical Precedence Rules:
%ignore /[ \t]+/
%ignore /#.*/
%%
S -> Num | 'if' | Id ;
Num -> /[0-9]+/ ;
Id -> /[a-z]+/ ;
|
1. |
[ \t]+ |
ignored |
2. |
#.* |
ignored |
4. |
if |
|
5. |
[0-9]+ |
|
6. |
[a-z]+ |
|
|
Equivalent Terminals
Terminals that produce equivalent regular expression objects are
considered to be the same terminal. Two regular expression objects re1
, re2
are equivalent if in Python, re1.pattern == re2.pattern
is True.
For example, in the following grammar:
%%
A -> B '+' ;
B -> '-' C ;
C -> '+' | /\+/ ;
There are only two token patterns: '\+'
and '\-'
. The pattern '\+'
has higher precedence than '\-'
.
You are not required to handle regular expression symbol "$" as it
provides needless complexity. That is, assume that the regular
expressions that your parser will seee on the input do not contain a
$ character.
%ignore
%ignore is intended for stripping out parts of input that are *difficult* to strip using grammar productions. Two examples:
- white space
- comments
For example, if you wanted your expression grammar to ignore whitespace and comments, you'd have to write a grammar like this:
E --> whitespace E whitespace '+' whitespace E | ...
whitespace --> /\s+|<comments>/
Not impossible but ugly because the whitespace would be everywhere.
You are likely to miss to add it here and there, and that would cause
endless debugging.
You should not use %ignore for delimiters, such as ';' after
statements, as these can appear in only one place in the grammar, unlike
the whitespace. In fact, depending on the syntax of language, you may
not be able to %ignore the delimiters -- the parser needs them for
disambiguation. Imagine you remove semicolon from between these two
statements:
x;(x)
you change two expressions
E ; ( E )
into a function call
E ( E )
FAQ (from fa09, when students were implementing the lexer):
Q: Do ignore operations always have higher priority than normal
non-terminals, or is it only in the case of a tiebreaker? For instance,
if you have the grammar
%ignore /ab/
%%
S -> Id ;
Id -> /[a-z]*/ ;
and you run the lexer on the input 'abcd', will it return an Id token with the lexeme 'abcd', or with the lexeme 'cd'?
A: An excellent question. I reread our spec and it is indeed ambiguous in this respect.
Both
alternatives would be acceptable because in practice this problem is
not very likely to arise --- you rarely want to ignore a prefix of a
legal lexeme, which is when our spec is ambiguous. Hence the semantic
difference between the two alternatives is not likely to show up, and so
you could in principle pick any of them.
But let's not pick just
any of them. When faced with a choice, it is better to select an
option that adds the smallest number of exceptions (ie corner cases) and
is thus clearer to the programmer.
The simplest semantics for
%ignore is that it defines tokens like any other (with the same
disambiguation rules), except when they are recognized, they are thrown
out.
This is what you should implement. This is what our parser does.