Jacc XML-Serialization Annotation

Hassan Aït-Kaci
Last modified on Tue Mar 13 16:43:53 2007 by hak

These notes describe a simple annotation system meant to enable the current implementation of Jacc to support the generation of an XML-serializer for a language from its grammar. Such a tool's purpose is to produce an XML tree representation of a parsed program (i.e., a correctly parsed element of the language generated by a Jacc grammar) by exploiting simple optional annotations that may be added to appropriate rules.


What is currently implemented

This explains informally how to us a simple annotation device in the current system (i.e. at the time of this writing) to produce a an XML serialization of the AST.

The basic annotation system

Essentially, the rule format of Jacc is that of Yacc. As in Yacc, Jacc rules may be annotated with semantic actions, in the form of code involving the rule's RHS constituents (denoted by $1, $2, ..., $n - the so-called pseudo-variables - where the index n in $n refers to the position of a constituent of the rule's RHS). Such actions appear between curly braces ('{' and '}') wherever a symbol may appear in a rule's RHS.

Jacc also allows an additional form of annotation in the RHS of a rule to indicate the XML serialization pattern of the abstract syntactic tree (AST) node corresponding to a derivation with this rule. This XML serialization meta-annotation comes between square brackets ('[' and ']').

The most basic form of XML annotation is of the form:

[ XmlTag n_1 ... n_k ]
where XmlTag is an identifier to use as XML tag for this node in the XML serialization of the AST, and the n_k's are a sequence of numbers denoting positions of symbols in the RHS of the rule (as for pseudo-variables but without '$'). The number sequence indicates the order in which subnodes are to be serialized.

For example, the annotated rule:

QUANTIF
   : 'Exists' Var_plus '(' CONDIT ')'
     [ Exists 2 4 ]
   ;
means that an AST node for this rule will be serialized thus:
 <Exists>
   (Xml serialization of Var_plus)
   (Xml serialization of CONDIT)
 </Exists>
Rules without XML serialization annotation follow a default behavior: the serialization is the concatenation of those of its RHS's constituents, eliminating punctuation tokens (i.e., empty nodes and literal tokens - namely, tokens that do not carry a value).

Using the above basic form and trusting the default bevavior for generating well-bracketted tag pairs is sufficient for simple serialization. However, whenever attributes are needed, or when the serialized AST form is not strictly homomorphic to the concrete syntax tree's, one needs more complex forms of annotation.


How it works

What the current system can do


What the curren system can't do

The basic system describe above works fine for the simplest case as illustrated above. It is, however, not sufficient if: We address these issues next.


How to extend what is currently implemented

Basic problem

We need to transform the "full" Concrete Syntax Tree (CST) generated by Jacc (i.e., in FULL_TREE mode) into an Abstract Syntax Tree AST represented as a W3C DOM-compliant XML tree.

The current version of the implementation supports only the basic scheme and does not actually ever build an XML tree. It simply builds a subtree of the CST following the annotation pattern at each rule as it is reduced. It then outputs each node as an XML element using the tag recorded in each parse node recursively.

Basic idea

There are two ways to go about this:
  1. Build the XML tree in one pass incrementally as the CST is built.
  2. Proceed in a two phase fashion: first build the full CST (i.e., wherein the complete available parse information is stored), then in a second pass, generate the XML tree from it (which then may be processed with any standard tools and serialized at will).

The first approach is sufficient if all the information needed for the synthesis of an XML element is available in the partial subtree available as the CST is being built in a bottom-up manner. However, it is not sufficient if information from siblings or ancestors is needed, since these are not existent at rule reduction time. (This is similar to the situation between synthesized and inherited attributes in attribute grammars.)

Since the annotation we will propose below for a node's XML representation relies exclusively on information from the CST subtree below a node and never on ancestors or siblings, we shall opt for the one-pass incremental bottom-up tree-building solution.

Jacc works from the leaves to the root (each CST tree node being a parse node - i.e., a ilog.language.syntax.ParseNode object). Thus, in order to build an XML AST from the CST, we proceed inductively from the parse nodes containing tokens, and work our way up at each rule reduction, building the xml tree as a field of the ilog.language.syntax.ParseNode standing for the reducing rule's LHS from the xml trees of its RHS.

In this way, we separate the parse node from its DOM compliant version. Hence, we will associate to each parse node a private instance variable _xmlNode of type org.dom4j.Document that will be set to a DOM compliant XML element representation of this parse node.

The annotation are per rule. The parse node corresponding to a rule's LHS symbol corresponds to the xml element specified by the annotation (if there is one) or the default value as per the default. Thus, the xml element of interior nodes may be built at rule reduction time.

For leaf nodes, however, the xml element is specified as the

Leaf nodes

These are parse nodes corresponding to tokens

Interior nodes

Topmost node