ANTLR Parser

eAnother Tool for Language Recognition (v4)

Contains

Tool: from grammar, to parser/lexer in Java (code generation) and

Run time: needed to run

Tree layout library

StringTemplate: for generating code and other structured text

Adaptive LL(*) or ALL(*) - extension to performs grammar analysis dynamically at runtime rather than statically; no contort grammars to fit underlaying parsing strategy not avoid ambiguity warning

Installation

Eclipse IDE integration: https://github.com/jknack/antlr4ide

Other integration see site.

Instruction for command line: https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Getting+Started+with+ANTLR+v4

ANTLR_LIB=/home/bing/Software_n_Docs/Development/ANTLR/antlr-4.2-complete.jar

export CLASSPATH=$CLASSPATH:$ANTLR_LIB:.target/generated-sources

alias antlr4="java -jar $ANTLR_LIB"

alias grun='java org.antlr.v4.runtime.misc.TestRig'

After installation, test with the example so the two alias commands works

antlr4
grun com.java.path.GrammarName ruleName [-tree] [-gui]

Test

antlr4 Hello.g4 # generate parser and lexer

grun GrammarName rulename -tokens

grun GrammarName rulename -tree

Using Eclipse IDE

create user library for antlr-4.2-complete.jar
attach JavaDoc
set ANTLR preference if desired
add package into build path
put a .g4 file in a source package
open and edit with syntax assistance
save and will automatically generate parser source (can be turned off)
set source file folder for generated source if necessary
can also right click / run grammar
edit external tools configuration to generate visitor / other parameter etc.

Concepts

Parser - program recognize languages

Grammar - specify language rules; ANTLR transform grammar to parser

Two phrases

lexical anaylsis : tokenizing
parsing: build a parse tree or syntax tree
- traditionally: embed application-specific code snippets in grammer
- parse trees: more decoupled and tidier, allows for multiple passes

Generated Parser

ANTLR tool generates: recursive-descent parsers

a collection of recursive methods, one per rule
descent means parsing begins at root (start symbol) of a parse tree and proceeds towards leaves
alternatives: uses SWITCH to make parsing decision or predication
lookahead token: just the next token, sniffs before matching and consuming it

Example: rule

assign : ID '=' expr ';' ; // match an assignment statement like "sp = 100;"

Generated products:

GrammarNameParser - parser, contains a method for each rule in the grammar and some support code

GrammarNameLexer - lexer

GrammarNameTokens - assigns a token type number for each token and stores here

GrammarNameListener.java, GrammarNameBaseListener.java - listener interface and an empty, default implementation for programmer to override

Ambiguity

Exact repetition in alternatives - which?
subtle ones - results two interpretations of same input

Rule ambiguity: ANTLR chooses the first alternative involved in decision

Lexer ambiguity: matching the first in grammar

Build language application

ANTLR classes:

CharStream
Lexer
Token
Parser
ParserTree
- RuleNode
  - ANTLR generates a RuleNode subclass for each rule
  - are named RuleNameContext, etc....
  - getChild() / getParent()
- TerminalNode
TokenStream : connects Lexer and Parser

Tree walking mechanisms

Parse-tree listener
- receives start / end callbacks
- ParseTreeWalker - ANTLR generates a ParseTreeWalker subclass for each grammar with enter and exit methods for each rule
- with label - enter and exit per label
- depth-first (below an example)
  - enter rule-level1
  - enter rule-level2
  - visitTerminal1...visitTerminal2...
  - exit...
Parse-tree visitor
- with option "-visitor" asks to generate a visitor interface from a grammar
- visitor interface: public interface RuleNameVisitor<T> {... visitLabelRuleEtc(RuleNameParser.NodeContext ctx);....}
- visit method:
  - without labels on alternatives: with a visit method per rule
  - with labels: one method per label
- a depth-first walk of the parse tree
- call visit() to move on, or visitXXX() to visit sub rule
Difference between
- invoke
  - listener methods are called by walker object (passive, listen)
  - visitor methods must walk their children with explicit visit calls (active)
    - which means visit is controlled
    - which means no invoke visit() means sub tree not visited
- number of methods
  - listener methods count can be HUGE where rules are many (enter / exit)
  - visitor fewer (only visit)
- method definition
  - visitor
    - return generic type T
    - like: public T visitXXX( XXXParser.XXXContext ctx)
  - listener
    - no return type
    - like: void enter/exitXXX (XXXParser.XXXContext ctx)
- base implementation
  - listener: nothing, empty method body (cause it's passive)
  - visitor: return visitChildren(ctx) - actively visit everything

Visitor Interface

enter / exit Rule
visitTerminal()
enter / exit EveryRule
visitErrorNode()

Rule Context Objects

method for each of the elements mentioned in rule
ctx.TOKENNAME() gets a TerminalNode, then
- .getText() - get the text for the token, or
- getSymbol() - get the token

Integrate generated parser

// import ANTLR's runtime libraries

import org.antlr.v4.runtime.*;

import org.antlr.v4.runtime.tree.*;

public class Test {

    public static void main(String[] args) throws Exception {

        // create a CharStream that reads from standard input

        ANTLRInputStream input = new ANTLRInputStream(System.in);

        // create a lexer that feeds off of input CharStream

        [GrammarName]Lexer lexer = new [GrammarName]Lexer(input);

        // create a buffer of tokens pulled from the lexer

        CommonTokenStream tokens = new CommonTokenStream(lexer);

        // create a parser that feeds off the tokens buffer

        [GrammarName]Parser parser = new [GrammarName]Parser(tokens);

        [RuleName]Context tree = parser.[ruleName](); // begin parsing at init rule

        System.out.println(tree.toStringTree(parser)); // print LISP-style tree

Following above example, here comes the ParseTreeWalker to talk the tree with a Listener implementation and the tree

// Create a generic parse tree walker that can trigger callbacks

ParseTreeWalker walker = new ParseTreeWalker();

// Walk the tree created during the parse, trigger callbacks

walker.walk(new ShortToUnicodeString(), tree);

Using a visitor

EvalVisitor eval = new EvalVisitor();

eval.visit(tree);

Grammar Lexicon

comments: java style, // or /* */
ID
- for token names, lexer rules: Start_upper_then_any
- for parser rule names: startLowerThenAny
Literals
- literal strings: 'in single quotes', '\'' (escape) '\u00E8' (unicode) '\n' '\r' '\t' '\b'(backspace) '\f'
- assumes UTF-8 format
Actions: code blocks in target language
- { arbitrary text surrounded by curly braces }
- { closing curly, if in "}" or /*}*/, no need to escape,
- if in pair, no need to escape
- or to escape extra curly with \, such as \{, \}}
keywords:
- import, fragment, lexer, parser,
- grammar, returns, locals,
- throws, catch, finally,
- mode, options, tokens

Grammar Structure

EBNF

grammar Name // declaration
- file must be called Name.g4
- without prefix (like above) are combined grammars
- parser grammar Name // only parser rules
- lexer grammar Name // only lexer rules
- below elements can be in any order
options {....}
import .......;
- inherits all rules, tokens specifications and named actions
- override imported grammars
- options ignored
- see more on document
tokens {.....}
- tokens { Token1, .... TokenN }
- implicit declaration token in rule - a : X;
@actionName {.....} // required in ANTLE3 but for ANTLR4, use visitor / listener
- @header { package foo;... } // into java header
- @members {int count=0; } // into java as member field / methods
- @after{}
rule1... // parser and lexer rules, possibly intermingled
- ruleName: alternative1 | ..... | alternativeN;
  - parser rules:
  - name start lower
  - /** Java doc comment can precede rule */
  - restart : 'return' expr ';' ; // an example
  - rule1 : 'extends' ID | // empty means entire rule is optional
  - sub rule : group with ()
  - Rule context objects
    - ANTLR will generate methods to access the rule context objects associated with each rule
    - example:
      - ruleName : e '++';
      - generates:
        public static class IncContext extends ParserRuleContext{
        public EContext e() {....}
      - field : e '.' e
      - generates:
        public static class FieldContext .....
        public EContext e(int i) {....} // get ith e context
        public List<EContext> e() {....} // get ALL e context
  - alternative labels:
    - label an alternative with # operator following a Name;
    - example: | e '+' e #BinaryOp
    - all alternatives within a rule must be labelled, or none
    - can be labelled the same label, but cannot conflict rule name
    - rule context class definition will be generated for each label (enter and exit methods)
    - recommend using to simplify coding
  - rule element labels
    - stat: 'return' value=e ';' #Return | 'break' ';' Break;
    - generates:
      - public static class ReturnContext extends StatContext{
      - public EContext value;....
- lexer rules:
  - name start capital
  - INT : [0-9]+ ; // like, but not regular expression, see reference
  - WS : [ \t\r\n]+ -> skip ; // define whitespace rule, and toss it out

Lexer Rules (not complete)

Google Sites

Report abuse