Next: , Previous: texinfo reflection, Up: Top


44 (text parse-lalr)

44.1 Overview

This file contains yet another LALR(1) parser generator written in Scheme. In contrast to other such parser generators, this one implements a more efficient algorithm for computing the lookahead sets. The algorithm is the same as used in Bison (GNU yacc) and is described in the following paper:

"Efficient Computation of LALR(1) Look-Ahead Set", F. DeRemer and T. Pennello, TOPLAS, vol. 4, no. 4, october 1982.

As a consequence, it is not written in a fully functional style. In fact, much of the code is a direct translation from C to Scheme of the Bison sources.

44.2 Defining a parser

The module (text parse-lalr) declares a macro called lalr-parser:

        (lalr-parser tokens rules ...)

This macro, when given appropriate arguments, generates an LALR(1) syntax analyzer. The macro accepts at least two arguments. The first is a list of symbols which represent the terminal symbols of the grammar. The remaining arguments are the grammar production rules.

44.3 Running the parser

The parser generated by the lalr-parser macro is a function that takes two parameters. The first parameter is a lexical analyzer while the second is an error procedure. The lexical analyzer is zero-argument function (a thunk) invoked each time the parser needs to look-ahead in the token stream. A token is usually a pair whose car is the symbol corresponding to the token (the same symbol as used in the grammar definition). The cdr of the pair is the semantic value associated with the token. For example, a string token would have the car set to 'string while the cdr is set to the string value "hello". Once the end of file is encountered, the lexical analyzer must always return the symbol '*eoi* each time it is invoked. The error procedure must be a function that accepts at least two parameters.

44.4 The grammar format

The grammar is specified by first giving the list of terminals and the list of non-terminal definitions. Each non-terminal definition is a list where the first element is the non-terminal and the other elements are the right-hand sides (lists of grammar symbols). In addition to this, each rhs can be followed by a semantic action. For example, consider the following (yacc) grammar for a very simple expression language:

       e : e '+' t
         | e '-' t
         | t
         ;
       t : t '*' f
         : t '/' f
         | f
         ;
       f : ID
         ;

The same grammar, written for the scheme parser generator, would look like this (with semantic actions)

     (define expr-parser
       (lalr-parser
        ; Terminal symbols
        (ID + - * /)
        ; Productions
        (e (e + t)    : (+ $1 $3)
           (e - t)    : (- $1 $3)
           (t)        : $1)
        (t (t * f)    : (* $1 $3)
           (t / f)    : (/ $1 $3)
           (f)        : $1)
        (f (ID)       : $1)))

In semantic actions, the symbol $n refers to the synthesized attribute value of the nth symbol in the production. The value associated with the non-terminal on the left is the result of evaluating the semantic action (it defaults to #f). The above grammar implicitly handles operator precedences. It is also possible to explicitly assign precedences and associativity to terminal symbols and productions a la Yacc. Here is a modified (and augmented) version of the grammar:

     (define expr-parser
      (lalr-parser
       ; Terminal symbols
       (ID
        (left: + -)
        (left: * /)
        (nonassoc: uminus))
       (e (e + e)              : (+ $1 $3)
          (e - e)              : (- $1 $3)
          (e * e)              : (* $1 $3)
          (e / e)              : (/ $1 $3)
          (- e (prec: uminus)) : (- $2)
          (ID)                 : $1)))

The left: directive is used to specify a set of left-associative operators of the same precedence level, the right: directive for right-associative operators, and nonassoc: for operators that are not associative. Note the use of the (apparently) useless terminal uminus. It is only defined in order to assign to the penultimate rule a precedence level higher than that of * and /. The prec: directive can only appear as the last element of a rule. Finally, note that precedence levels are incremented from left to right, i.e. the precedence level of + and - is less than the precedence level of * and / since the formers appear first in the list of terminal symbols (token definitions).

44.5 A final note on conflict resolution

Conflicts in the grammar are handled in a conventional way. In the absence of precedence directives, Shift/Reduce conflicts are resolved by shifting, and Reduce/Reduce conflicts are resolved by choosing the rule listed first in the grammar definition. You can print the states of the generated parser by evaluating (print-states). The format of the output is similar to the one produced by bison when given the -v command-line option.

44.6 Usage

— Special Form: lalr-parser tokens . rules

The grammar declaration special form. tokens is the list of token symbols, and rules are the grammar rules. See the module documentation for more details.

— Function: print-states

Print the states of a generated parser.