eAnother Tool for Language Recognition (v4)
Tool: from grammar, to parser/lexer in Java (code generation) and
Run time: needed to run
Tree layout library
StringTemplate: for generating code and other structured text
Adaptive LL(*) or ALL(*) - extension to performs grammar analysis dynamically at runtime rather than statically; no contort grammars to fit underlaying parsing strategy not avoid ambiguity warning
Eclipse IDE integration: https://github.com/jknack/antlr4ide
Other integration see site.
Instruction for command line: https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Getting+Started+with+ANTLR+v4
ANTLR_LIB=/home/bing/Software_n_Docs/Development/ANTLR/antlr-4.2-complete.jar
export CLASSPATH=$CLASSPATH:$ANTLR_LIB:.target/generated-sources
alias antlr4="java -jar $ANTLR_LIB"
alias grun='java org.antlr.v4.runtime.misc.TestRig'
After installation, test with the example so the two alias commands works
- antlr4
- grun com.java.path.GrammarName ruleName [-tree] [-gui]
antlr4 Hello.g4 # generate parser and lexer
grun GrammarName rulename -tokens
grun GrammarName rulename -tree
- create user library for antlr-4.2-complete.jar
- attach JavaDoc
- set ANTLR preference if desired
- add package into build path
- put a .g4 file in a source package
- open and edit with syntax assistance
- save and will automatically generate parser source (can be turned off)
- set source file folder for generated source if necessary
- can also right click / run grammar
- edit external tools configuration to generate visitor / other parameter etc.
Parser - program recognize languages
Grammar - specify language rules; ANTLR transform grammar to parser
Two phrases
- lexical anaylsis : tokenizing
- parsing: build a parse tree or syntax tree
- traditionally: embed application-specific code snippets in grammer
- parse trees: more decoupled and tidier, allows for multiple passes
ANTLR tool generates: recursive-descent parsers
- a collection of recursive methods, one per rule
- descent means parsing begins at root (start symbol) of a parse tree and proceeds towards leaves
- alternatives: uses SWITCH to make parsing decision or predication
- lookahead token: just the next token, sniffs before matching and consuming it
Example: rule
assign : ID '=' expr ';' ; // match an assignment statement like "sp = 100;"
Generated products:
GrammarNameParser - parser, contains a method for each rule in the grammar and some support code
GrammarNameLexer - lexer
GrammarNameTokens - assigns a token type number for each token and stores here
GrammarNameListener.java, GrammarNameBaseListener.java - listener interface and an empty, default implementation for programmer to override
- Exact repetition in alternatives - which?
- subtle ones - results two interpretations of same input
Rule ambiguity: ANTLR chooses the first alternative involved in decision
Lexer ambiguity: matching the first in grammar
Build language application
ANTLR classes:
- CharStream
- Lexer
- Token
- Parser
- ParserTree
- RuleNode
- ANTLR generates a RuleNode subclass for each rule
- are named RuleNameContext, etc....
- getChild() / getParent()
- TerminalNode
- TokenStream : connects Lexer and Parser
- Parse-tree listener
- receives start / end callbacks
- ParseTreeWalker - ANTLR generates a ParseTreeWalker subclass for each grammar with enter and exit methods for each rule
- with label - enter and exit per label
- depth-first (below an example)
- enter rule-level1
- enter rule-level2
- visitTerminal1...visitTerminal2...
- exit...
- Parse-tree visitor
- with option "-visitor" asks to generate a visitor interface from a grammar
- visitor interface: public interface RuleNameVisitor<T> {... visitLabelRuleEtc(RuleNameParser.NodeContext ctx);....}
- visit method:
- without labels on alternatives: with a visit method per rule
- with labels: one method per label
- a depth-first walk of the parse tree
- call visit() to move on, or visitXXX() to visit sub rule
- Difference between
- invoke
- listener methods are called by walker object (passive, listen)
- visitor methods must walk their children with explicit visit calls (active)
- which means visit is controlled
- which means no invoke visit() means sub tree not visited
- number of methods
- listener methods count can be HUGE where rules are many (enter / exit)
- visitor fewer (only visit)
- method definition
- visitor
- return generic type T
- like: public T visitXXX( XXXParser.XXXContext ctx)
- listener
- no return type
- like: void enter/exitXXX (XXXParser.XXXContext ctx)
- base implementation
- listener: nothing, empty method body (cause it's passive)
- visitor: return visitChildren(ctx) - actively visit everything
- enter / exit Rule
- visitTerminal()
- enter / exit EveryRule
- visitErrorNode()
- method for each of the elements mentioned in rule
- ctx.TOKENNAME() gets a TerminalNode, then
- .getText() - get the text for the token, or
- getSymbol() - get the token
Integrate generated parser
// import ANTLR's runtime libraries
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;
public class Test {
public static void main(String[] args) throws Exception {
// create a CharStream that reads from standard input
ANTLRInputStream input = new ANTLRInputStream(System.in);
// create a lexer that feeds off of input CharStream
[GrammarName]Lexer lexer = new [GrammarName]Lexer(input);
// create a buffer of tokens pulled from the lexer
CommonTokenStream tokens = new CommonTokenStream(lexer);
// create a parser that feeds off the tokens buffer
[GrammarName]Parser parser = new [GrammarName]Parser(tokens);
[RuleName]Context tree = parser.[ruleName](); // begin parsing at init rule
System.out.println(tree.toStringTree(parser)); // print LISP-style tree
}
}
Following above example, here comes the ParseTreeWalker to talk the tree with a Listener implementation and the tree
// Create a generic parse tree walker that can trigger callbacks
ParseTreeWalker walker = new ParseTreeWalker();
// Walk the tree created during the parse, trigger callbacks
walker.walk(new ShortToUnicodeString(), tree);
Using a visitor
EvalVisitor eval = new EvalVisitor();
eval.visit(tree);
- comments: java style, // or /* */
- ID
- for token names, lexer rules: Start_upper_then_any
- for parser rule names: startLowerThenAny
- Literals
- literal strings: 'in single quotes', '\'' (escape) '\u00E8' (unicode) '\n' '\r' '\t' '\b'(backspace) '\f'
- assumes UTF-8 format
- Actions: code blocks in target language
- { arbitrary text surrounded by curly braces }
- { closing curly, if in "}" or /*}*/, no need to escape,
- if in pair, no need to escape
- or to escape extra curly with \, such as \{, \}}
- keywords:
import
, fragment
, lexer
, parser
, grammar
, returns
, locals
,throws
, catch
, finally
,mode
, options
, tokens
EBNF
- grammar Name // declaration
- file must be called Name.g4
- without prefix (like above) are combined grammars
- parser grammar Name // only parser rules
- lexer grammar Name // only lexer rules
- below elements can be in any order
- options {....}
- import .......;
- inherits all rules, tokens specifications and named actions
- override imported grammars
- options ignored
- see more on document
- tokens {.....}
tokens { Token1, .... TokenN }
- implicit declaration token in rule - a : X;
- @actionName {.....} // required in ANTLE3 but for ANTLR4, use visitor / listener
- @header { package foo;... } // into java header
- @members {int count=0; } // into java as member field / methods
- @after{}
- rule1... // parser and lexer rules, possibly intermingled
ruleName: alternative1 | ..... | alternativeN;
- parser rules:
- name start lower
- /** Java doc comment can precede rule */
restart : 'return' expr ';' ; // an example
rule1 : 'extends' ID | // empty means entire rule is optional
- sub rule : group with ()
- Rule context objects
- ANTLR will generate methods to access the rule context objects associated with each rule
- example:
- ruleName : e '++';
- generates:
- public static class IncContext extends ParserRuleContext{
- public EContext e() {....}
- field : e '.' e
- generates:
- public static class FieldContext .....
- public EContext e(int i) {....} // get ith e context
- public List<EContext> e() {....} // get ALL e context
- alternative labels:
- label an alternative with # operator following a Name;
- example:
| e '+' e #BinaryOp
- all alternatives within a rule must be labelled, or none
- can be labelled the same label, but cannot conflict rule name
- rule context class definition will be generated for each label (enter and exit methods)
- recommend using to simplify coding
- rule element labels
stat: 'return' value=e ';' #Return | 'break' ';' Break;
- generates:
- public static class ReturnContext extends StatContext{
- public EContext value;....
- lexer rules:
- name start capital
- INT : [0-9]+ ; // like, but not regular expression, see reference
- WS : [ \t\r\n]+ -> skip ; // define whitespace rule, and toss it out
Lexer Rules (not complete)