SlopeJ5 - Parser Construction for Java 5 and above - A quick tutorial

SlopeJ5 - Parser Construction for Java 5 and above

- A quick tutorial

1. Introduction

Despite the fact that several compiler compilers are available, it is not that trivial to define a new language using such tools. The reason for it is, that such tools require an additional tool chain that needs to be integrated into the development environment and some understanding how such tool chains work.

Slope's goal of is to define formal languages - i.e. parsers - in a relatively simple way: directly within your programming language. Thereto, Slope defines a set of classes and a set of idioms how these classes should be used.

Note: Slope's goal is to be easy to use, not to be fast. I.e. there is no way that a parser definition using Slope is able to compete with a parser definition based on a tool such as ANTLR.

2. Getting Slope

You can download SlopeJ5 from bitbucket: https://bitbucket.org/shanenberg/slope

3. Parser Construction using Slope

In order to define a parser, two things need to be done: first, you define the scanner, second you define the parser.

3.1 Defining a Scanner

The first step in a parser definition is the scanner definition. The scanner's task is

  1. to divide the characters from the incoming stream into a number of tokens, and
  2. to define separators, i.e. tokens that are used to separate other tokens.

In slope, a scanner is something that uses a class that extends the class slowscane.Tokens. Typically, you start defining a scanner by defining the separators and then the other tokens.

    public class TestLangTokens extends slowscane.Tokens {
        // here appear additional lines for separators, keywords and other tokens
    }

3.1.1 Separators

Separators are tokens that are used in the token stream to separate other tokens. Examples for such tokens that are often found in parser definitions are line breaks or whitespaces. You define separators as methods that have the annotation @Separator and that return an object of type slowscane.scanners.TokenReader. The class TokenReader is a generic class with a type parameter that says something about the kind of data being provided by a token.

Slope defines a number of predefined token readers in the class slowscane.Tokens that can be access within a scanner class using the field lib.

If you want to define white spaces and new lines as separators, you define within the tokens class TestLangTokens two methods:

@Separator public TokenReader<?> WHITESPACE() { return lib.WHITESPACE; }

@Separator public TokenReader<?> NEWLINE() { return lib.NEWLINE;}

The first line defines a white space as a separator, the second line defines the new line as a separator. It is necessary to use the annotation in the code, i.e. without the annotation both lines will be ignored while scanning.

3.1.2 Keywords and Tokens

In the next step, the other tokens need to be defined. Slope defines between two different kinds of tokens: keywords and tokens. A keyword is a token that consists of a fixed string (or a single fixed character). The reason for making this distinction is, that it is possible that two token readers could accept the same (or a similar) set of characters.

For example, if a keyword such as "Int" should be used as a token while a Java identifier should be used as a token as well, there is the potential problem that it is unclear for the scanner whether a token sequence such as "Int something" consists of the keyword "Int" followed by a java identifier or whether it consists of two java identifiers. By telling slope that "Int" is a keyword, the int token reader has a higher priority than the java identifier. Consequently, Slope would then divide the sequence "Int something" into two tokens: the first one is the keyword "Int", the second one a java identifier (consisting of the characters "something").

Keyword are annotated with the annotation @Keyword, ordinary tokens are annotated with the annotation @Token.

@Keyword public TokenReader<String> Int() { return lib.KEYWORD_Int; }

@Token public TokenReader<String> JAVA_IDENTIFIER() { return lib.JAVA_IDENTIFIER; }

@Token public TokenReader<?> OBRACKET() { return lib.OPEN_BRACKET; }

@Token public TokenReader<?> CBRACKET() { return lib.CLOSED_BRACKET; }

3.1.2 Using (and Testing) the Scanner

Slope's convention for testing is, that the test class is an outer class for the scanner (and the parser) - however, it is not necessary to follow this guideline if you consider it annoying. Following Slope's convention, you define the test class as an outer class and the scanner as a static inner class (make it static in order to make it extensible).

You can use the scanner by creating an instance of slowscane.Scanner. The constructor expects a Tokens object (such as the one that has just been created) and a position stream. It is quite likely that you will test your scanner first with a simple String object. In that case, you create an instance of slowscane.streams.StringBufferStream by passing a String object to it.

    public TokenStream<?> scan(String inputString) {
        Tokens tokens = new TestLangTokens();
        StringBufferStream stringBufferStream = new StringBufferStream(inputString);
        Scanner<?, StringBufferStream> scanner = 
            new Scanner<Tokens, StringBufferStream>(tokens, stringBufferStream);
        return scanner.scan();
    }

Now it is possible to read the tokens from the tokenstream and define some tests on them.

public void testScan() {

TokenStream<?> ts = scan("( \n ( (Int ) )");

assertEquals("(", ts.get(0).getValue().toString());

assertEquals("(", ts.get(1).getValue().toString());

assertEquals("(", ts.get(2).getValue().toString());

assertEquals("Int", ts.get(3).getValue().toString());

}

The previous tests check, whether the tokens are as expected: three opening brackets followed by the keyword "Int" (two more brackets appear, but this is not tested in the code).

Ok, that's it for the scanner definition. Now we need to define a parser.


3.2 Defining a Parser

In slope a parser is a inner class of the previously defined tokens definition class. The parser's task is to describe rules about the ordering of the tokens in the tokenstream as returned by the scanner.

Traditionally, languages defined by parsers are described using context-free grammars. For example, the language that is to be defined in this small tutorial is defined by the grammar

<Expression> -> "Int" | JavaIdentifier | <BracketExpression>+

    <BracketExpression> -> "(" <Expression> ")" 

The start symbol of this grammar is <Expression> (Note: What exactly a JavaIdentifier is, is not defined in the grammar. Also, the grammar does not say a word about possible separators). The grammar defines a language that consists either of the keyword "Int", a Java identifier or a bracket expressions that is allowed mulitple times. A bracket expression consists of an opening bracket, an expression and a closing bracket.

3.2.1 Defining the Parser Class

The parser class is a subclass of slope.Parser<TOKENS extends Tokens>. The class consists of a number of methods that describe the rules of the grammar to be implemented by the parser. Each method has the following characteristics:

  1. The return type of a rule method is slope.Rule,
  2. The body of the method creates and returns an anonymous inner class (which is a subtype of slope.Rule).
  3. The body makes use of predefined methods from the parser (each of them return rule objects themselves)
    1. TOKEN: The method takes a token method as parameter which must match the following token from the stream.
    2. AND: The method AND combines all rules passed as parameters via concatenation (i.e. all passed rules must match one after another)
    3. OR: The method OR combines each passed rule via a union (i.e. at least one of the passed rules much match)
    4. NFOLD: The passed rule (one rule!) matches at least one time, but can also match multiple times.
    5. set: This method is required to define parse trees and will be explained later.

The class declaration of the parser can already be done as follows:

// This is the outer testing class in the example
public class TestLangParsing extends TestCase {
    // This is the tokens class which is the outer class for the parser
    public static class TestLangTokens extends slowscane.Tokens {
        // This is the parser class
        public class TestLangParser extends Parser<TestLangTokens> {
            // Here appear the rule methods
        }
    }
}

3.2.2 Using TOKEN()

In case we want to define a rule that says that the token as previously defined by the method Int() should be accepted (and if we want to call this method KEYWORD_INT()), we write the following method in the parser class:

Rule KEYWORD_Int() { return new Rule() { public PC pc() { return

TOKEN(Int());
}};}

The first line consists of a number of elements that just can be copied from different rule definitions: the return type RULE and the first part of the body, which is "return new RULE() { public PC pc() { return". The same is true for the last line, which just closes the definition of the anonymous inner class and the method definition.

The only interesting part of the body is the use of TOKEN() with parameter Int(). The rule says that the next token from the stream must be something as defined by the token method Int(). The rule method has the name KEYWORD_Int() and can be used by other rule methods in the parser class.

We create a similar rule method for the identifier:

Rule IDENTIFIER() { return new Rule() { public PC pc() { return

TOKEN(JAVA_IDENTIFIER());
}};}

3.2.3 Using OR()

In order to define a rule <Expression>, we define a method in the same way as before: it requires a return type RULE and the first part of the body is "return new RULE() { public PC pc() { return", and the last line of the method is "}};}". The rule says, that a simple expression is either the keyword "Int", an identifier or a N_TIMES_BRACKET_EXPRESSION. What a bracket expression is, isn't defined yet, but the other two parts (KEYWORD_Int and IDENTIFIER) are already known:

Rule EXPRESSION() { return new Rule() { public PC pc() { return

OR(KEYWORD_Int(), IDENTIFIER(), N_TIMES_BRACKET_EXPRESSION());

}};}

3.2.4 Using AND()

The rule <BracketExpression> concatenates an opening bracket (which is a token), an expression, and a closing bracket (which is a token).

Rule BRACKET_EXPRESSION() { return new Rule() { public PC pc() { return

AND(TOKEN(OBRACKET()), EXPRESSION(), TOKEN(CBRACKET()));

}};}

3.2.5 Using NFOLD()

We still haven't defined the rule N_TIMES_BRACKET_EXPRESSION(). For defining this rule, we use the method NFOLD. The method NFOLD takes a single rule as a parameter (which might be an AND or an OR rule) and says that the passed rule might appear multiple times (but at least once).

Rule N_TIMES_BRACKET_EXPRESSION() { return new Rule() { public PC pc() { return

NFOLD(BRACKET_EXPRESSION());
}};}

In our example, the passed rule is BRACKET_EXPRESSION(). Hence, this rule might appear multiple times.

3.2.6 Using and Testing the Parser

Again, for testing the parser, slope's guideline is to define an outer class (which already should have been defined before) and define the inner Tokens class as a static class. And again, we assume to test the parser first with a string.

Parsing is achieved by a slope.Parsing object that is instantiated with a tokens object and the string to be parsed (alternatively, a stream can be passed). Then, the method parse() needs to be invoked with the rule that should be used as the start symbol. First, we define in our test class a method check that receives the string to be tested in the following way:

public TreeNode check(String s) {

TestLangTokens tokens = new TestLangTokens();

Parsing p = new Parsing(tokens, s);

return p.parse(tokens.new TestLangParser().EXPRESSION());

}

Note that the start rule is passed by instantiating the parser object first and then invoking the rule on it. Now, it is possible to use the method check to test whether a certain word is part of the language defined by the parser.

public void testParse() {
  check("(   (Int) )\n   (abc)   ( ((def)))");
} 

The resulting code so far can be downloaded at the end of this document (file TestLangParsing.java).

In fact, check() already returns a parse tree. However, this parse tree is not really useful for the developer. Instead, it is more likely that developers will define their own, customized parse tree which will be explained next. Before defining the parse tree, we copy the file TestLangParsing.java and rename the containing class (and the file) to TestLangParsingWithParseTree.

3.3 Defining a Parse Tree

The convention in Slope is that if a customized parse tree is required (which is probably always the case), that the rule method contains one parameter which is a SetCmd object.

3.3.1 Defining Custom Parse Tree Classes

Before defining how the the parse tree is constructed, we need to define the data structure for the parse tree.

abstract class Expression {}

class IntExpression extends Expression {}

class IdentifierExpression extends Expression { public String identifier; }

class BracketExpression extends Expression { public Expression childExpression; }

class NFOLDBracketExpression extends Expression { public ArrayList<BracketExpression> bracketExpressions = new ArrayList<BracketExpression>();}

The parse tree should return an expression, which is either an IntExpression, an IdentifierExpression, a BracketExpression, or an NFOLDBracketExpression. The class Expression is abstract, because no object should be directly instantiated from it. The IntExpression represent the keyword "Int", i.e. no additional data needs to be stored. The IdentifierExpression should store the identifier. The BracketExpression needs to store the child expression.

3.3.2 Using SetCmd , Cmd, and setResult()

If a rule object needs to return a parse tree (respectively a sub-tree), three things are required:

  1. the rule method needs a final SetCmd object as parameter (which we call the result setter object),
  2. a result object needs to be created,
  3. the result object needs to be passed to the result setter

We start with the rule method KEYWORD_Int:.

Rule KEYWORD_Int(final SetCmd<Expression> resultSetter) { return new Rule() { public PC pc() {

// Defines the object which should be the result

final IntExpression expr = new IntExpression();

// Defines the command object that sets the result

final Cmd setResult = new Cmd() {public void doCmd() { resultSetter.set(expr); }};

// defines that after a successful parsing, the setResult command will be executed.

return

TOKEN(Int(), setResult);

}};}

A SetCmd object provides a method set that receives the value to be set as the result. The type parameter of SetCmd defines the type to be set. Since the rule method KEYWORD_Int is expected to return an expression, the type of the parameter is SetCmd<Expression>. The object expr is the one that should be the result of the method (in case parsing is successful in this method). The object setResult passes (if executed) the return object to resultSetter. And finally, there is a method TOKEN that receives two parameters: the first one is the token to be read, the second one the Cmd that is executed when the rule method is successfully applied.

Instead of creating a setResult object by hand, the parser provides a method setResult (which requires a result setter and a value as parameters). Using this method, the previous method can be written in a slightly more compact way:

Rule KEYWORD_Int(final SetCmd<Expression> resultSetter) { return new Rule() { public PC pc() {

// defines that after a successful parsing, the resultSetter receives a new IntExpression.

return

TOKEN(Int(), setResult(resultSetter, new IntExpression()));

}};}

3.3.3 Passing Token Values to a Parse Tree Node

When we want to do a similar thing for the rule method IDENTIFIER(), we need to create a SetValue object on our own, because the value of the token to be parsed should be assigned to the identifier field of an IdentifierExpression object. We do that by instantiating an anonymous inner class of type SetCmd<String> that receives from the TOKEN rule a string (because the token reader JAVA_IDENTIFIER() receives a String value) .

Rule IDENTIFIER(final SetCmd<Expression> resultSetter) { return new Rule() { public PC pc() {

// Defines the command object that sets the token string to the return object and sets the result object
final SetCmd<String> setValue = new SetCmd<String>() { public void set(String value) { 

IdentifierExpression expr = new IdentifierExpression();

expr.identifier = value;

resultSetter.set(expr);}};


return

TOKEN(JAVA_IDENTIFIER(), setValue);
}};}

In the code above there is no additional Cmd object that sets the result object. Instead, we do that in one step within setValue: first, we bind the token string to the field identifier, second, we set the result.

3.3.4 Adding Parse Tree Nodes to Collections

A special situation is given for the rule method N_TIMES_BRACKET_EXPRESSION because each time a bracket expression appears it should be added to the collection bracketExpressions within NFOLDBracketExpression. We do that by defining a SetCmd that adds a passed value to the collection. Finally, we need to pass the return object. The method NFOLD can receive an additional Cmd object as parameter which will be invoked whenever the NFOLD rule is applicable.

Rule N_TIMES_BRACKET_EXPRESSION(final SetCmd<Expression> resultSetter) { return new Rule() { public PC pc() {

final NFOLDBracketExpression expr = new NFOLDBracketExpression();
final SetCmd<BracketExpression> addExpression = new SetCmd<BracketExpression>() {public void set(BracketExpression value) {
expr.bracketExpressions.add(value);}};

return

NFOLD(BRACKET_EXPRESSION(addExpression), setResult(resultSetter, expr));

}};}

It occurs quite often that an object should be added to a list via a set command. Because of that, Slope provides in the parser class a special method that constructs such a set command and receives a list to which a value should be added. Using this method, the code above can be written in a different way:

Rule N_TIMES_BRACKET_EXPRESSION(final SetCmd<Expression> resultSetter) { return new Rule() { public PC pc() {

final NFOLDBracketExpression expr = new NFOLDBracketExpression();

return

NFOLD(

BRACKET_EXPRESSION(addResultSetter(expr.bracketExpressions)),

setResult(resultSetter, expr));

}};}


3.3.5 Using set()

The rule method that's left is BRACKET_EXPRESSION. It receives the resultSetter, creates a BracketExpression, sets the expressions child and sets the result. A special situation here is, that AND does not support directly a Cmd object as parameter. Instead, we use the method set() defined in the parser that invokes the command object when the rule is applicable.

Rule BRACKET_EXPRESSION(final SetCmd<BracketExpression> resultSetter) { return new Rule() { public PC pc() {

final BracketExpression expr = new BracketExpression();


final SetCmd<Expression> setChild = new SetCmd<Expression>() { public void set(Expression value) {

expr.childExpression = value;}};


return

AND(TOKEN(OBRACKET()), EXPRESSION(setChild), TOKEN(CBRACKET()), set(setResult(resultSetter, expr)));

3.3.6 Finally...

The rule method EXPRESSION() now needs to be fixed. It receives an additional SetCmd parameter and simply passes it to it's sub rules.

Rule EXPRESSION(final SetCmd<Expression> result) { return new Rule() { public PC pc() { return
OR(KEYWORD_Int(result), IDENTIFIER(result), N_TIMES_BRACKET_EXPRESSION(result));
}};}

Now, we are done and need to test the parser.

3.3.7 Testing the Parse Tree

For testing the parser, we now need to pass a result setting object to the expression rule method. We do that by creating a class ResultObject which has an expression as field and creating a SetCmd that receives an expression and sets it to the field.

}};}

public class ResultObject {

public Expression expression;

}


public Expression check(String s) {

final ResultObject res = new ResultObject();

SetCmd<Expression> expr = new SetCmd<Expression>() {public void set(Expression value) {

res.expression = value;

}};

TestLangTokens tokens = new TestLangTokens();

Parsing p = new Parsing(tokens, s);

p.parseWithParseTreeConstruction(tokens.new TestLangParser().EXPRESSION(expr));

return res.expression;

}

To make the code above slightly simpler, Slope provides a default ResultObject implementation (with a corresponding resultSetter). Using such methods, the code above can be reduced to:

public Expression check(String s) {

final ResultObject<Expression> res = new ResultObject<Expression>();

TestLangTokens tokens = new TestLangTokens();

Parsing p = new Parsing(tokens, s);

p.parseWithParseTreeConstruction(tokens.new TestLangParser().EXPRESSION(res.setter));

return res.result;

}

Now, we can run some tests on the parse tree:

public void testParse() {

Expression ex = check("Int");

assertTrue(ex instanceof IntExpression);

ex = check("(Int)");

assertTrue( ex instanceof NFOLDBracketExpression);

NFOLDBracketExpression nbex = (NFOLDBracketExpression) ex;

assertEquals(1, nbex.bracketExpressions.size());

assertTrue(nbex.bracketExpressions.get(0) instanceof BracketExpression);

BracketExpression bex = (BracketExpression) nbex.bracketExpressions.get(0);

assertTrue(bex.childExpression instanceof IntExpression);

}

The code of this tutorial can be found under: slope.lib.toyexamples.TestLangParsing, respectively slope.lib.toyexamples.TestLangParsingWithParseTree.

4 Some final notes

4.1 How Slope works / problems with OR

Slope works in a greedy way: whenever from Slope's perspective a rule is fulfilled, it just assumes that this rule matches and Slope does not try any alternative. This is important to understand when a rule such as

    <something> -> ("x" "y") | ("x" "y" "z")

will be checked: if the first two letters are x and y, Slope will not try the second branch (because the first branch can be read from the token stream). To overcome such a problem, simply changing the ordering in the rule solves the problem:

<something> -> ("x" "y" "z") | ("x" "y")

4.2 Never assign values directly

For the parse tree construction, there is hardly any case where a value can be directly set without using a command object. Slope first constructs the internal parse tree and then walks over it in order to evaluate the command objects. I.e. setting an object directly will not wait until the parse tree is constructed.

5 Etc.

If you have any questions or comments, just leave a message or write a mail to me (stefan.hanenberg@gmail.com).