LR-Parsing with Dynamic Operators

Jacc's LR-Parsing with Dynamic Operators

This part of the Jacc documentation explains the modifications we can make to a basic table-driven LR parser generator à la yacc to accommodate support for Prolog's dynamic operators.

Prolog's operators

In Prolog, operators may be declared at runtime using the built-in predicate op/3. For example,

        op(500,yfx,+)

declares the symbol '+' to be an infix binary left-associative operator with binding looseness 500 (binding looseness is the opposite of precedence - anti-precedence, so to speak).

The second argument of op/3 is called the operator's specifier. It is a symbol that encodes three kinds of information concerning the operator:

arity (unary or binary),
fixity (prefix, infix, or postfix),
associativity (left-, right-, or non-associative).

The specifier is an identifier consisting of either two or three of the letters 'f', 'x', and 'y', which are interpreted as follows. The letter 'f' stands for the operator's position in an expression, and the letters 'x' and 'y' stand for the arguments' positions. The letters 'f', 'y', and 'x', are simply mnemonics for, respectively, "functor" (i.e., indicating the position of the operator symbol with respect to its argument(s)), "yes" to mean that the argument associates on this side of the operator, and "no" to mean that it does not associate on this side of the operator. In other words, a 'y' occurring on the left (resp., right) of 'f', means that the operator is left-associative (resp., right-associative), and an 'x' occurring on the left (resp., right) of 'f', means that the operator is not left-associative (resp., right-associative). Thus, the possible operator specifiers are:

Specifier Arity Fixity Associativity

fx unary prefix non-associative

fy unary prefix right-associative

xf unary postfix non-associative

yf unary postfix left-associative

xfx binary infix non-associative

xfy binary infix right-associative

yfx binary infix left-associative

Specifier	Arity	Fixity	Associativity
`fx`	unary	prefix	non-associative
`fy`	unary	prefix	right-associative
`xf`	unary	postfix	non-associative
`yf`	unary	postfix	left-associative
`xfx`	binary	infix	non-associative
`xfy`	binary	infix	right-associative
`yfx`	binary	infix	left-associative

Note that yfy is not allowed as an operator specifier because that would mean an ambiguous way of parsing the operator by associating either to the left or to the right.

The binding looseness used by Prolog's op/3 works in fact as the opposite of the precedence level used in parsing: the smaller a Prolog operator's binding looseness measure is, the more it takes precedence for parsing. These binding looseness measures range inclusively from 1 (maximum precedence) to 1200 (minimum precedence).

The third argument of op/3 can be any syntactically well-formed Prolog functor. In particular, these need not be known as operators prior to runtime. Prolog's tokenizer only recognizes such a token as a functor. Thus, any functor, whether declared operator or not, can always be parsed as a prefix operator preceding a parenthesized comma-separated sequence of arguments. Whether it is a declared operator determines how it may be parsed otherwise. In Sicstus Prolog, for example, declaring:

        ?- X = 1+2.

yields this query's answer:

        ?- X = 1+2 ? 

        X = 1 + 2.
        yes

and declaring:

        ?- X = +(1,2).

yields the same query's answer:

        ?- X = 1+2 ? 

        X = 1 + 2.
        yes

Prolog's parser can accommodate dynamic operators for two reasons:

The syntax of Prolog is completely uniform - there is only one syntactic construct: the first-order term. Even what appear to be punctuation symbols are in fact functors (e.g., ':-', ',', ';', etc...). Indeed, in Prolog everything is either a logical variable or a structure of the form f(t₁, ...,t_n).
Prolog's parser is an operator-precedence parser where precedence and associativity information is kept as a dynamic structure (see Dragon Book, Section 4.6, pp. 203-215).

Operator-precedence parsing is a bottom-up shift-reduce method that works simply by shifting over the input looking for a handle in a sentential form being built on the stack, and reducing when such a handle is recognized. A handle is the substring of a sentential form whose right end is the leftmost operator whose following operator has smaller precedence, and whose left end is the rightmost operator to the left of this right-end operator (inclusive), whose preceding operator has smaller precedence. This substring includes any nonterminals on either ends. For example, if '*' has higher precedence than '+', the handle in the string "Ex + Ex * Ex + Ex" is its substring "Ex * Ex".

Operator-precedence parsing is possible only for a restricted class of context-free grammars: the so-called operator grammars. A context-free grammar is an operator grammar if and only if no production's right-hand side is empty or contains two adjacent nonterminals.

For example, the two-rule context-free grammar where '(', ')', 'ID', '+', '*', '-', and '/' are terminal symbols:

        Ex -> Ex Op Ex | '(' Ex ')' | '-' Ex |  'ID'
        Op ->    '+'   |     '*'    |   '-'  |  '/'

is not an operator grammar. But the equivalent one-rule context-free grammar:

        Ex -> Ex '+' Ex | Ex '*' Ex | Ex '-' Ex | Ex '/' Ex | '(' Ex ')' | '-' Ex | 'ID'

is. It is not difficult to see that a Prolog term can be recognized by an operator grammar. Namely,

        Term -> 'VAR' | 'FUN' | 'FUN' '(' Body ')' | 'FUN' Term | Term 'FUN' Term | Term 'FUN' | '(' Term ')'
        Body -> Term  | Term ',' Body

where 'VAR', 'FUN', '(', ')', ',', are terminal symbols; in particular, 'FUN', is dynamically tokenized setting the terminal 'FUN' string value to the functor the tokenizer recognized it to be — and can thus easily accommodate dynamic operators.

Adapting LR Parsing

In an LR parser such as one generated by yacc, precedence and associativity information is no longer available at parse-time. It is used statically at parser generation-time to resolve potential conflicts in the parser's actions. Then, a fixed table of unambiguous actions is passed to drive the parser, which therefore always knows what to do in a given state for a given input token.

Thus, although they can recognize a much larger class of context-free languages, conventional shift-reduce parsers for LR grammars cannot accommodate parse-time ambiguity resolution. Although this makes parsing more efficient, it also forbids a parser generated by a yacc-like parser generator to support Prolog style operators.

In what follows, we propose to reorganize the structure of the implementation of a yacc-style parser generator to accommodate Prolog-style dynamic operators. We do so:

increasing the user's convenience to define and use new syntax dynamically without changing the parser;
adding new features while preserving the original yacc metasyntax;
retaining the same efficiency as yacc-parsing for grammars which do not use dynamic operators;
augmenting the recognizing power of bottom-up LALR parsing to languages that support dynamically (re)definable operators;
making full use of the object-oriented capabilities of Java to allow the grammar specifier to tune the parser generation using user-defined classes and attributes.

Declaring dynamic operators

The first issue pertains to the way we may specify how dynamic operators are connected with the grammar's production rules. The command:

        %dynamic op

is used to declare that the parser of the grammar being specified will allow defining, or redefining, dynamic operators of category op. The effect of this declaration is to create a non-terminal symbol named op that stands for this token category. Three implicit grammar rules are then automatically defined:

        op : op_ | _op_ | _op ;

which introduce, respectively, prefix, infix, and postfix, subcategories for operators of category op. These are terminal symbols standing as generic tokens that denote specific operators for each fixity. They are always an uppercase version of the dynamic category symbol, with underscores attached on each side as a mnemonic way of showing where their arguments are expected.

Once a dynamic operator category has ben declared such as above, specific operators of this category may be defined in the grammar specicification as follows:

        %op <operator> <specifier> <looseness>

or, equivalently:

        %op <operator> <looseness> <specifier>

For example,

        %op '+' yfx 500

or, equivalently,

        %op '+' 500 yfx

declares the symbol '+' to be an infix binary left-associative operator of category op, with binding looseness 500, as in Prolog.

In addition, the generated parser defines the following method:

        public final void op (String operator, String specifier, int looseness)

whose effect is to define, or redefine, an operator for the token category op dynamically using the given (Prolog-style) specifier and (Prolog-style) binding looseness. It is this method that can be invoked in a parser's semantic action at parse time.

An operator's category name may be used in a grammar specification wherever an operator of that category is expected. However, one may limit occurrences of such operators to those of specific fixity only using the terminal symbols:

op_ for a prefix operator of category op;
_op for a postfix operator of category op;
_op_ for an infix operator of category op.

For example, rules for operations on expressions can be specified thus:

        Expression : op_ Expression
                   | Expression _op
                   | Expression _op_ Expression
                   ;

A major modification in the parser generator and the generic parser must also be made regarding the parser's actions. A state may have contending actions on a given input. Such a state is deemed conflictual if and only if the input creating the conflict is a dynamic operator, or if one of its conflicting actions is a reduction with a rule with a tag that is a dynamic operator. (Recall that a rule's tag is the last terminal symbol occurring in its body, if one does; otherwise the rule has no tag.) All other states can be treated as usual, resolving potential conflicts using the conventional method based on precedence and associativity. Clearly, a dynamic operator token category does not have this information but delegates it to the specific token, which will be known only at parse time. At parser-construction time, a pseudo-action is generated for conflictual states which delays decision until parse time. It uses the state's table associating a set of actions with the token creating the conflict in this state. These sets of conflicting actions are thus recorded for each conflictual state.

When a token is identified and the current state is a conflictual state, which action to perform is determined by choosing in the action set associated to the state according to the same disambiguation rules followed by the static table construction but using the current precedence and associativity values of the specific operator being read. If a reduce action in the set involves a rule tagged with a dynamic operator, which precedence and associativity values to use for the rule are those of the specific operator tag for that rule, which can be obtained in the current stack. The stack offset of that operator will depend on which of the dynamic operator's rules is being considered.

Ambiguous tokens

Note that in general, the tokenizer may return a set of possible tokens for a single operator. Consider for example the following grammar:

        %token '!'
        %dynamic op1
        %op1 '!' yf 200
        %dynamic op2
        %op2 '!' yfx 500
        %%
        Expression : '!' Expression
                   | Expression _op1
                   | Expression _op2_ Expression
                   ;
        %%

For this grammar, the character '!' may be tokenized as either '!', 'op1', or 'op2'.

The tokenizer can therefore be made to dispense with guaranteeing a token's lexical category. Looking up its token category tables, the parser then determines the set of admissible lexical categories for this token in the current state (i.e., those for which it has an action defined). If more than one token remain in the set, a choice point for this state is created. Such a choice point records the current state of parsing for backtracking purposes. Namely, the grammar state, and the token set. The tokens are then tried in the order of the set, and upon error, backtracking resets the parser at the latest choice point deprived of the token that was chosen for it.

Note that the use of backtracking for token identification is not a guarantee of complete recovery. First, full backtracking is generally not a feasible nor desirable option as it would entail possibly keeping an entire input stream in memory as the buffer grows. The option is to keep only a fixed-size buffer and flush from the choice point stack any choice point that becomes stale when this buffer overflows. In effect, this enforces an automatic commit whenever a token choice is not invalidated within the time it takes to read further tokens as allowed by the buffer size.

Second, although backtracking restores the parser's state, it does not automatically undo the side effects that may have been performed by the execution of any semantic action encountered between the failure state and the restored state. If there are any, these must be undone manually. We thus let the grammar formalism allow to specify undo actions to perform when a rule is backtracked over.

The only limitation -- shallow backtracking -- is not serious, and in fact the choice-point stack's size can be specified arbitrarily large if need be. However, any input that overruns the choice-point stack's default depth is in fact cleaning up space by getting rid of older and less likely to be used choice-points. Indeed, failure occurs generally shortly after a wrong choice has been made. We give separately a more detailed specification of the implementation of the shallow backtracking scheme that is adequate for this purpose.

Token declarations

Tokens are declared in yacc with the commands %token, %right, %left, and %nonassoc. These commands also give the tokens they define a precedence level according to the order of declarations, tokens of equal precedence being declared in the same command. Since we wish to preserve compatibility with yacc's notations and conventions, we keep these commands to have the same effect. Therefore, these commands are used as usual to declare static tokens. However, we must explicate how the implicit precedence level of static token declarations may coexist with the explicit precedence information specified by the Prolog-like dynamic operator declarations.

We also wish to preserve compatibility with Prolog's conventions. Recall that the number argument in a Prolog 'op'/3' declaration denotes the binding looseness of the operator, which is inversely related to parsing precedence. The range of these numbers is the interval [1,1200]. To make this compatible with the foregoing yacc commands, the Grammar class defines two constants:

        static final int MIN_PRECEDENCE = 1;
        static final int MAX_PRECEDENCE = 1200;

In order to have the binding looseness to be such that 1200 corresponds to minimum precedence and 1 to maximum precedence, we simply define the precedence level of binding looseness n to be 1200-n+1. Thus, a declaration such as:

        %op '+' yfx 500

associates to binary '+' a precedence level of 701 (i.e., 1200-500+1).

More generaly, dynamic operators can be declared with the form:

        %op <operator> <specifier> <precedence>

or:

        %op <operator> <precedence> <specifier>

or the simpler form:

        %op <operator> <specifier>

leaving the precedence implicit, and defaulting to the precedence level effective at the command's execution time.

The first encountered token declaration with implicit precedence (i.e., a conventional yacc token command or a two-argument dynamic operator command) uses the initial precedence level set to Grammar.MIN_PRECEDENCE, then increments it by a fixed increment. This increment is 10 by default, but using the command:

 %precstep <number>

sets the increment to the given number. This command may be used several times. Each subsequent declaration with implicit precedence uses the current precedence level, then increments the precedence level by the current precedence increment. Any attempt to set a precedence level outside the [1,1200] range is ignored: the closest bound is used instead (i.e., 1 if less and 1200 if more), and a warning is issued.

Author:: Hassan Aït-Kaci
Version:: Last modified on Mon Dec 16 20:00:04 2019 by hak