Resolving ambiguous tokens in LR-parsing

The Tokenizer interface specifies that its nextToken() method should return a ParseNode object. Such an object will generally have its symbol field set by the tokenizer's nextToken() method to hold the lexical category known to the parser (of type ParserTerminal) along with the specific value of the token in either the node's svalue or nvalue field. In the case where the token's lexical category cannot be determined by the tokenizer, then node's symbol field is not set (i.e., has value null).

Regulated tokenizing

Three global structures are used to regulate the tokenizing:

The tokenChoice stack:

This is a stack of finite size which contains choice point objects containing all the information needed to recover the parsing process at this point with the next admissible token value in the case where the one previously selected at this point led to an error state. It is implemented as an instance of a FiniteStack object, which is a stack of limited memory that remembers elements pushed only up to a finite history, and ``forgets'' anything beyond that.

The trailStack stack:

This is also a FiniteStack. Its elements are TrailEntry objects. A TrailEntry consists of three elements:

a handle: the RSH that was reduced;
a rule: the grammar rule used;
a time stamp: a unique OID.

Information in this stack is coordinated with that contained in the choice-point stack. Namely, there are only as many trail entries as there were rules applied since the current oldest choice-point was created.

Upon backtracking, the trail stack is unwound and all the rules that were applied since the last choice point are undone. That is, they are applied forward (top down), in the inverse order they were traversed when applied backward (bottom up). The parser's stack is recovered from the trail entries's handles, and after each forward application of a trailed rule, its undo semantic action is performed. All tokens read since the last choice points are recovered from the parser's stack rewinding and pushed in the structure decribed next.

The readStack stack:

This is a normal stack that acts as an "unreading" buffer for tokens that are popped off the stack by backtracking and need to be read again.

The sizes of the two first structures can be specified by the grammar specifier, otherwise are set by default values (which work fine generally for "normal" grammars).

Reading a token

A call to readToken() triggers the following chain of events.

If the readStack buffer is not empty, then readToken() uses the token indicated by the top of the readStack stack.
If readStack is exhausted, readToken() calls on the tokenizer to provide the next token node.
Whether the current token node has a lexical category or not, the readToken() method looks up the terminal and operator tables to determine all the admissible ParseNode values that this token may have in the current state. If only one such value can be found, then it is used as the value returned by the parser's readToken() method. Otherwise, a choice point is created and pushed on the tokenChoice stack.

Choice points

A choice point must contain at least the following information:

a stack of admissible tokens;
a time stamp (unique OID).

Note that pushing a new choice point on the tokenChoice stack may cause it to overflow. In this case again, the oldest choice point is discarded after dropping from the trailStack all entries older than the next oldest choice point in the tokenChoice stack.

Trail entries

A trail entry point must contain at least the following information:

the rule which was applied;
the handle that was reduced.
a time stamp (unique OID).

Again, pushing a new trail entry point on the trailStack stack may cause it to overflow. In this case again, the oldest trail entry is discarded after dropping from the tokenChoice stack all choice points older than the next oldest trail entry in the trailStack stack.

Backtracking

If an error is encountered and the tokenChoice stack is empty, then error recovery is attempted as normally done in LR-parsing.

If the tokenChoice stack is not empty, the admissible token stack of the current choice point is popped and used to set currentToken, and the trail stack is unwound, which causes the parser stack to be rewound to the stack that was in effect when this choice point was created. All undo actions are performed as the rules are unapplied.

If popping the admissible token stack of the current choice point leaves it empty, the choice point is discarded. Discarding a choice point means popping it from the tokenChoice stack and resetting the trail stack accordingly.

Author:: Hassan Aït-Kaci
Version:: Last modified on Sat Oct 05 19:15:58 2002 by hak