This file contains yet another LALR(1) parser generator written in Scheme. In contrast to other such parser generators, this one implements a more efficient algorithm for computing the lookahead sets. The algorithm is the same as used in Bison (GNU yacc) and is described in the following paper:
"Efficient Computation of LALR(1) Look-Ahead Set", F. DeRemer and T. Pennello, TOPLAS, vol. 4, no. 4, october 1982.
As a consequence, it is not written in a fully functional style. In fact, much of the code is a direct translation from C to Scheme of the Bison sources.
The module (text parse-lalr)
declares a macro called lalr-parser
:
(lalr-parser tokens rules ...)
This macro, when given appropriate arguments, generates an LALR(1) syntax analyzer. The macro accepts at least two arguments. The first is a list of symbols which represent the terminal symbols of the grammar. The remaining arguments are the grammar production rules.
The parser generated by the lalr-parser
macro is a function that takes
two parameters. The first parameter is a lexical analyzer while the second is an
error procedure. The lexical analyzer is zero-argument function (a thunk)
invoked each time the parser needs to look-ahead in the token stream. A token is
usually a pair whose car
is the symbol corresponding to the token (the
same symbol as used in the grammar definition). The cdr
of the pair is
the semantic value associated with the token. For example, a string token would
have the car
set to 'string
while the cdr
is set to the
string value "hello"
. Once the end of file is encountered, the lexical
analyzer must always return the symbol '*eoi*
each time it is invoked.
The error procedure must be a function that accepts at least two parameters.
The grammar is specified by first giving the list of terminals and the list of non-terminal definitions. Each non-terminal definition is a list where the first element is the non-terminal and the other elements are the right-hand sides (lists of grammar symbols). In addition to this, each rhs can be followed by a semantic action. For example, consider the following (yacc) grammar for a very simple expression language:
e : e '+' t | e '-' t | t ; t : t '*' f : t '/' f | f ; f : ID ;
The same grammar, written for the scheme parser generator, would look like this (with semantic actions)
(define expr-parser (lalr-parser ; Terminal symbols (ID + - * /) ; Productions (e (e + t) : (+ $1 $3) (e - t) : (- $1 $3) (t) : $1) (t (t * f) : (* $1 $3) (t / f) : (/ $1 $3) (f) : $1) (f (ID) : $1)))
In semantic actions, the symbol $n
refers to the synthesized attribute
value of the nth symbol in the production. The value associated with the
non-terminal on the left is the result of evaluating the semantic action (it
defaults to #f
). The above grammar implicitly handles operator
precedences. It is also possible to explicitly assign precedences and
associativity to terminal symbols and productions a la Yacc. Here is a modified
(and augmented) version of the grammar:
(define expr-parser (lalr-parser ; Terminal symbols (ID (left: + -) (left: * /) (nonassoc: uminus)) (e (e + e) : (+ $1 $3) (e - e) : (- $1 $3) (e * e) : (* $1 $3) (e / e) : (/ $1 $3) (- e (prec: uminus)) : (- $2) (ID) : $1)))
The left:
directive is used to specify a set of left-associative
operators of the same precedence level, the right:
directive for
right-associative operators, and nonassoc:
for operators that are not
associative. Note the use of the (apparently) useless terminal uminus
. It
is only defined in order to assign to the penultimate rule a precedence level
higher than that of *
and /
. The prec:
directive can only
appear as the last element of a rule. Finally, note that precedence levels are
incremented from left to right, i.e. the precedence level of +
and
-
is less than the precedence level of *
and /
since the
formers appear first in the list of terminal symbols (token definitions).
Conflicts in the grammar are handled in a conventional way. In the absence of
precedence directives, Shift/Reduce conflicts are resolved by shifting, and
Reduce/Reduce conflicts are resolved by choosing the rule listed first in the
grammar definition. You can print the states of the generated parser by
evaluating (print-states)
. The format of the output is similar to the one
produced by bison when given the -v command-line option.