In this tutorial, we will use the classical example: calculator, to show how JParsec's declarative API can be used to create a parser.
Our calculator should be able to:
A typical parsing process takes 2 steps:
First, let's create our lexer. The lexer should be able to recognize decimal number, parenthesis and the operators. Input character sequence will be recognized and translated to tokens.
To recognize a decimal number, we could use regular expression. Jparsec has support for regular expression-based scanner.
We will use straight jparsec API, however, to present a more intuitive way.
The jfun.parsec.pattern.Pattern and jfun.parsec.pattern.Patterns are two useful classes to represent any sophisticated pattern of input characters.
The following code represents a pattern of integer:
final Pattern digits = Patterns.range('0','9').many(1);
A decimal number is 1 or more digits optionally followed by a '.' and 1 or more digits as the fractional part:
final Pattern fractional = Patterns.isChar('.').seq(digits); final Pattern number = digits.seq(fractional.optional());
Finally, we use this Pattern object to create our Parser object:
final Parser s_number = Scanners.isPattern(number);
In fact, JParsec does have built-in support for frequently used lexers such as integer, string literal, decimal number etc. They can be found in the Lexers class.
To better serve the "tutorial" purpose, we chose to write the lexers from scratch instead of using these pre-built functions.
A scanner Parser object only matches input. It knows the position and the length of the match, but it doesn't return any token.
We need to use a jfun.parsec.Tokenizer object to convert the matched sub-sequence to a token object:
final Parser l_number = Scanners.lexer(s_number, jfun.parsec.tokens.TokenDecimal.getTokenizer());
To create lexers for the operators, the helper class Words can be used:
final Words ops = Lexers.getOperators(new String[]{"+","-","*","/", "(", ")"}); final Parser l_ops = ops.getLexer();
Now, we are done with all the necessary tokens of the calculator. The only thing left is the white spaces.
This calculator allows any number of white spaces before and after any number or operator.
single-line comment and block comment are also allowed and treated as white space.
final Parser s_line_comment = Scanners.javaLineComment(); final Parser s_block_comment = Scanners.isBlockComment("/*","*/"); final Parser s_whitespace = Scanners.isWhitespaces();
The calculator will ignore any one of the above 3, so:
final Parser s_delim = Parsers.sum(new Parser[]{ s_line_comment, s_block_comment, s_whitespace });
The lexer that recognize and return tokens is:
final Parser l_token = Parsers.plus(l_ops, l_number);
The lexer object should greedily scan the entire input, ignore the whitespaces
and comments, recognize the tokens and generate an array of the tokens.
The Scanners.lexeme(...) function is the ideal helper function to
create such Parser object:
final Parser lexer = Scanners.lexeme(s_delim.many(), l_tok) .followedBy(Parsers.eof());
A Parser object can be a character level - which is fed a sequence of characters, or a token level - which is fed with an array of tokens.
The lexer is a character level object, we need a token level Parser object to finish the job.
The token level Parser object should recognize different tokens and does the interpretation
based on the grammar rules.
From the lexer, we could have two different kinds of tokens: decimal numbers and operators. To convert the decimal number token to a Double object, we can use the helper class Terms:
final Parser p_number = Terms.decimalParser(new FromString(){ public Object fromString(int from, int len, String s){ return Double.valueOf(s); } });
The lexer already recognizes the operators and converted them into special tokens. It is now time to apply grammar rules to these tokens.
Using the Words object created in the Lexer section, we could easily get the Parser objects for these operators:
final Parser plus_op = Parsers.token(ops.getToken("+"));
Similar code can be applied to all the other operators. To avoid code duplication, we can create a function:
private static Parser istoken(Words ops, String op){ return Parsers.token(ops.getToken(op)); }Thus, the plus operator becomes:
final Parser plus_op = istoken(ops, "+");
In a classic recursive-desent parser, operators with different precedence are handled by defining different production rules, for example:
term_expr ::= ... muldiv_expr ::= muldiv_expr ('*'|'/') term_expr expr = expr ('+'|'-') muldiv_exprThis solution can lead to messy production rules when the number of operators and precedence levels scales up.
It is more desirable to be able to specify the operator precedence declaratively.
Another drawback of recursive-desent parser is left recursion.
In order to preserve the left-associative nature of the operators, the above production rules, are defined left-resursively (the first rule at the right hand side of muldiv_expr is muldiv_expr itself).
Such left-recursive definition will lead to infinite-recursion in a naive recursive-desent parser implementation.
JParsec provides support for operator precedence grammar and automatically handles left recursion incured by left-associative operators.
Our target syntax should look like:
final OperatorTable optable = new OperatorTable() .infixl(p_plus, 10) .infixl(p_minus, 10) .infixl(p_mul, 20) .infixl(p_div, 20) .prefix(p_neg, 30); ...where all the operator precedence and associativity are specified declaratively.
There are some preparation to do though.
Firstly, the infixl(...) method expects a Parser object that returns a
jfun.parsec.Map2 object.
jfun.parsec.Map2 object is responsible for implementing the semantic action of the binary operator.
We can either create an abstract syntax tree object, or directly apply the operator to do the calculation.
Similarly, the prefix(...) method expects a Parser object that returns a
jfun.parsec.Map object, which is responsible for implementing the semantic action of the unary operator.
We do the direct calculation in this example:
final Map2 plus = new Map2(){ public Object map(Object o1, Object o2){ final Number n1 = (Number)o1; final Number n2 = (Number)o2; return new Double(n1.doubleValue()+n2.doubleValue()); } };This Map2 object takes the two operands, and run the calculation by adding the two double values together.
Since the downcasting and the creation of Double object is the same to all binary operators, we can do a little bit refactoring:
private interface Operator{ double cal(double a, double b); } private static Map2 toMap2(final Operator op){ return new Map2(){ public Object map(Object o1, Object o2){ final Number i1 = (Number)o1; final Number i2 = (Number)o2; return new Double(op.cal(i1.doubleValue(), i2.doubleValue())); } }; } private static Parser getOperator2(Words ops, String op, Operator v){ //for creating Parser for binary operator. return getOperator(ops, op, toMap2(v)); } private static Parser getOperator(Words ops, String op, Object v){ return istoken(ops, op).seq(Parsers.retn(v)); } final Operator plus = new Operator(){ public double cal(double n1, double n2){ return n1+n2; } }; final Operator minus = new Operator(){ public double cal(double n1, double n2){ return n1-n2; } }; ... final Parser p_plus = getOperator2(ops, "+", plus); final Parser p_minus = getOperator2(ops, "-", minus);
Now, with the OperatorTable object we created earlier, we can use the Expressions helper class to build a parser that handles the operators correctly:
final Parser p_expr = Expressions.buildExpressionParser(p_number, optable);where the p_number is the parser that parses a decimal number to a Double object; the optable is the OperatorTable object that declares all the operator precedences and associativity.
With the lexer and the parser objects ready, we can execute them to calculate real expressions.
First, we need to concatenate the lexer and the parser objects so that the parser object will take as input the output of the lexer object:
final Parser parser = Parsers.parseTokens(lexer, p_expr.followedBy(Parsers.eof()), "calculator");
The parser object can then be used to actually parse/interpret expression:
final String src = "1+1*2-10"; final Object result = Parsers.runParser(src, p, 1, "test1"); System.out.println(src+" = " + result);
Now if we run the above code, we will get:
1+1*2-10 = -7
The Parser object works except there's one thing missing: parenthesis.
We wanted the parser to be able to parse sub-expression grouped by parenthesis.
The production rule is:
op ::= '+' | '-' | '*' | '/' term ::= number | '(' expr ')' expr ::= expr op term
It is impossible to represent such mutual-recursive production with declarations of local Parser variables. Lazy evaluation is necessary in this case:
final Parser[] lazy_expr = new Parser[1]; final Parser p_lazy_expr = Parsers.lazy(new ParserEval(){ public Parser eval(){ return lazy_expr[0]; } }); final Parser p_term = Parsers.plus( Parsers.between(p_lparen, p_rparen, p_lazy_expr), p_number );
Thus, the expression parser can be changed to:
final Parser p_expr = Expressions.buildExpressionParser(p_term, optable);
The final parser code is as following:
import jfun.parsec.*; import jfun.parsec.pattern.Pattern; import jfun.parsec.pattern.Patterns; import jfun.parsec.tokens.TokenDecimal; public class Calculator { private interface Operator{ double cal(double a, double b); } private static Map2 toMap2(final Operator op){ return new Map2(){ public Object map(Object o1, Object o2){ final Number i1 = (Number)o1; final Number i2 = (Number)o2; return new Double(op.cal(i1.doubleValue(), i2.doubleValue())); } }; } public Parser getParser(){ final Pattern digits = Patterns.range('0','9').many(1); final Pattern number = digits.seq(Patterns.isChar('.').seq(digits).optional()); final Parser s_number = Scanners.isPattern("number", number, "number expected"); final Parser s_line_comment = Scanners.javaLineComment(); final Parser s_block_comment = Scanners.isBlockComment("/*","*/"); final Parser s_whitespace = Scanners.isWhitespaces(); final Parser l_number = Scanners.lexer(s_number, TokenDecimal.getTokenizer()); final Words ops = Lexers.getOperators(new String[]{"+","-","*","/", "(", ")"}); final Parser s_delim = Parsers.sum(new Parser[]{ s_line_comment, s_block_comment, s_whitespace }).many(); final Parser l_tok = Parsers.plus(ops.getLexer(), l_number); final Parser lexer = Scanners.lexeme(s_delim, l_tok) .followedBy(Parsers.eof()); final Operator plus = new Operator(){ public double cal(double n1, double n2){ return n1+n2; } }; final Operator minus = new Operator(){ public double cal(double n1, double n2){ return n1-n2; } }; final Operator mul = new Operator(){ public double cal(double n1, double n2){ return n1*n2; } }; final Operator div = new Operator(){ public double cal(double n1, double n2){ return n1/n2; } }; final Map neg = new Map(){ public Object map(Object o){ return new Double(((Number)o).doubleValue()); } }; final Parser p_plus = getOperator2(ops, "+", plus); final Parser p_minus = getOperator2(ops, "-", minus); final Parser p_mul = getOperator2(ops, "*", mul); final Parser p_div = getOperator2(ops, "/", div); final Parser p_neg = getOperator(ops, "-", neg); final Parser p_lparen = istoken(ops, "("); final Parser p_rparen = istoken(ops, ")"); final Parser p_number = Terms.decimalParser(new FromString(){ public Object fromString(int from, int len, String s){ return Double.valueOf(s); } }); final Parser[] lazy_expr = new Parser[1]; final Parser p_lazy_expr = Parsers.lazy(new ParserEval(){ public Parser eval(){ return lazy_expr[0]; } }); final Parser p_term = Parsers.plus( Parsers.between(p_lparen, p_rparen, p_lazy_expr), p_number ); final OperatorTable optable = new OperatorTable() .infixl(p_plus, 10) .infixl(p_minus, 10) .infixl(p_mul, 20) .infixl(p_div, 20) .prefix(p_neg, 30); final Parser p_expr = Expressions.buildExpressionParser(p_term, optable); lazy_expr[0] = p_expr; return Parsers.parseTokens(lexer, p_expr.followedBy(Parsers.eof()), "calculator"); } private static Parser getOperator2(Words ops, String op, Operator v){ return getOperator(ops, op, toMap2(v)); } private static Parser getOperator(Words ops, String op, Object v){ return istoken(ops, op).seq(Parsers.retn(v)); } private static Parser istoken(Words ops, String op){ return Parsers.token(ops.getToken(op)); } }A complete test case is:
public class TestCalculator extends TestCase { public void test1(){ final Parser p = new Calculator().getParser(); final String src = "/*this is beginning**/(1+1+ 2.1)+3*(((2)-10))//this is end"; final Object obj = Parsers.runParser(src, p, 1, "test1"); assertEquals(obj, new Double(-19.9)); } }
In this tutorial, we chose to run the calculation directly in the semantic actions.
We could defer this calculation and create intermediary abstract syntax tree for further analysis and processing (such as type checking and code optimisation). But that makes no real difference in the parser code. We just need to replace the direct calculation code with the code to create the abstract syntax tree.
This short tutorial cannot cover every detail of the JParsec API. There are many other useful utility functions provided to help the process of constructing a sophisticated parser.
Please refer to the JParsec API Documentation for further information.
The frequently asked questions can be found here
You are welcome to shoot me a note to ajoo.email@gmail.com should you have any comment or question.