XmlAnnotationSpecification.grm

// FILE. . . . . /home/hak/hlt/src/hlt/language/syntax/xml/XmlAnnotationSpecification.grm
// EDIT BY . . . Hassan Ait-Kaci
// ON MACHINE. . 4j4zn71
// STARTED ON. . Sat Apr 07 15:27:08 2007

// Last modified on Sun May 27 06:24:58 2018 by hak



[See also the Java sources]

We need a means to annotate a Jacc grammar so as to ease and automate the process of specifying an XML serialization for the language defined by the grammar. The way we proceed is by annotating some rules and terminals to produce an XML form built out of those XML forms built for the constituents of the CST (i.e., from a terminal's contents or a rule's RHS).

This specifies a grammar for a simple annotation language meant to enable passing XML formatting information from a Jacc grammar to a Jacc parser. This defines the form of what goes between square bracketed arguments of the xmlinfo command annotating a terminal, or appearing in a rule being annotated for XML conversion (i.e., for serialization purposes). Doing this gives us great flexibility for extending or modifying the annotation meta-syntax simply by:

  1. modifying the Jacc grammar source file,
  2. running the jacc command on it to regenerate the XmlAnnotationParser java source, and
  3. recompiling.
Et voilà ! ...

Specification

Basic annotation notation

We first introduce the basic annotation notation for the very common case when the XML tree to be constructed from the CST is homomorphic to the CST in that it only needs information that is local to the CST node. We will extend this notation later when the tree construction is heteromorphic, needing information from below this node.

In order for the parser of the annotation notation to stay small and light-weight, as well as avoiding ambiguity and stay strictly within LALR(1) recognition power, we will adopt the following very simple keyword-driven syntax. For example:

  A0 : A1 A2 A3 A4 
     [
       nsprefix   : "foo"
       localname  : "Azero"
       attributes : {a = "bar blah", b="blech"}
       children   : (2, 3)
     ] 
     ;
  
means that the XMLification of an A0 node (created when the parser uses this grammar rule bottom-up) will look like:
  <foo:Azero a="bar blah" b="blech">
    xmlification of A2
    xmlification of A3
  </foo:Azero>
  
In such an annotation, appearing in any order, the notation "keyword : value" is such that keyword is an admissible keyword. An admissible keyword is such that there may be at most one occurrence per annotation of:
        nsprefix
        localname
        attributes
        children
  
Such an admissible keyword is followed by a value, which may be either an identifier, a single- or double-quoted string, or a list between curly braces {...}, or parentheses (...), the nature of this list's brackets and elements depending on the keyword. See the grammar rules for details.

N.B.: The annotation is meant to be light-weight - all these keywords may be abbreviated to any non-empty case-insensitive prefix of their full form, and some punctuation may be used interchangeably or simply omitted: e.g., the ":" separating keywords and values, the "," separating list elements, as well as unnecessary quotes, are all in fact optional. The following key/value separator symbols may be used: ":", "=", "->", "=>", or they may be simply omitted. Similarly, the following list separator symbols may be used: "," (comma), ";" (semicolon), or they may be simply omitted. See the associate tokenizer class: XmlAnnotationTokenizer.java for details.

For example, the above same annotation could equally be written as follows:

  A0 : A1 A2 A3 A4 
     [
       NS foo
       LO Azero
       AT {a -> 'bar blah';  b -> blech}
       CH (2 3)
     ]
     ;
  

More complex annotation notation

The simple notation above is all one needs in many common cases: it works whenever the XML serialization pattern is constructible only from the immediate constituents of the rule's LHS (A0) - i.e., the XML trees of the rule's RHS symbols (n>0). It is, however, insufficient for expressing XML serialization patterns that depend on sub-elements contained within those of the XML serialization of the RHS symbols. The simple case is called homomorphic tree transduction, while the more complex case is called heteromorphic tree transduction. [NB: "homo-morphic" = "of similar form" (from the Greek "homo-morfos", meaning "same shape"), and "hetero-morphic" = "of dissimilar form" (from the Greek "hetero-morfos", meaning "different shape").]

A more elaborate XML annotation notation extends the above basic notation by allowing the values of attributes and children in the annotation to take on more complex forms denoting a reference to the desired XML constructs within the XML trees already built for the CST children of this node. Following are some simple color-coded examples illustrating the meaning of these annotations, showing how the basic notation for homomorphic tree-transduction annotations for attribute and children is extended to express heteromorphic tree-transduction as well.

Children annotation:

  • Basic children annotation notation:

      CH:(2, 4)
      
    specifies that the XML children are, in this order:

    1. XML form of 2nd child CST,
    2. XML form of 4th child CST.

  • Extended children annotation notation:

    • Granchildren notation:
        CH:(2[1], 4[2])
        
      specifies that the XML children are, in this order:

      1. 1st XML component of the XML form of 2nd child CST,
      2. 2nd XML component of the XML form of 4th child CST.

    • Descendants notation:
        CH:(2[1.4], 1[2.1.3])
      specifies that the XML children are, in this order:

      1. 4th XML component of 1st XML component of the XML form of 2nd child CST,
      2. 3rd XML component of 1st XML component of 2nd XML component of the XML form of 1st child CST.
    • Attribute reference notation:
        CH:(2[1.4]/foo)
      specifies that the only XML child is:

      1. the string value of the attribute named foo of the 4th XML component of 1st XML component of the XML form of 2nd child CST.
    • Wrapper notation:
        CH:(foo.2, bar.fuz.4)
        
      specifies that the XML children are, in this order:

      1. <foo> XML form of 2nd child CST </foo>
      2. <bar> <fuz> XML form of 4th child CST </fuz> </bar>

      N.B.: By default, wrappers do not distribute over their contents. In other words, the resulting form will be one with all the contents wrapped in a single nesting of wrappers. If it is desired to override this default behavior and actually distribute a wrapper tag path over the sequence making up the contents being wrapped, then one uses an asterisk ('*') instead of a dot ('.'), as in, e.g.:

        CH:(foo*2, bar*fuz.4)

      Thus, using an asterisk rather than a dot in specifying a wrapper path triggers one of three things depending on whether the contents being wrapped is:

        1. nothing: nothing;
        2. a single XML element: the wrapped single element;
        3. a sequence of XML elements: the sequence of wrapped elements.

  • Summary of children notation

    The full form of the annotation expression for specifying children is:
    CH:(... w1 ... wnc[x1. ... .xm]/a ...)
    of which only c is mandatory. The four parts of a child specification expression are:

    1. The wrapper path: w1 ... wn is optional: each wrapper wi is a pair made of a (unquoted, single-quoted, or double-quoted) string (an XML tag), followed by a distribution marker : either a dot ('.'), or an asterisk ('*'). Using a dot triggers single wrapping, while using an asterisk triggers distributive wrapping (at the nesting level specified).

    2. The child: there are two cases:

      1. in a rule's annotation, c is a positive integer and denotes a position in the rule's RHS (i.e., a position in the CST) and refers to the XML tree corresponding to the child CST at this position;

      2. c may not be a number but a special form. In this case, there may be nothing trailing after c; i.e., [x1. ... .xm]/a is empty.

    3. The XML tree path:[x1. ... .xm] is optional; if not empty, it denotes a path in the XML tree corresponding to referring CST child, each xj being a positive integer denoting a child position in the XML tree rooted in this referring CST child.

    4. The attribute reference: /a is optional; when present, a is a (possibly unquoted, single-quoted, or double-quoted) string; it must be the name of an XML attribute in the ultimate XML tree referred to by c[x1. ... .xm], and denotes the string content making up the value of that XML attribute.

Attributes annotation:

  • Basic attribute annotation notation:

      AT:{foo="bar"}
    sets the attribute named foo to the literal string value "bar".

  • Extended attribute notation:

    • Child's text value:
        AT:{foo=3}
      sets the attribute named foo to the text value of the XML form of 3rd child CST.

    • Child's attribute value:
        AT:{foo=3/bar}
      sets the attribute named foo to the value of the attribute named bar in the XML form of 3rd child CST.

    • Descendant's text value:
        AT:{foo=3[1.2]}
      sets the attribute named foo to the text value of the 2nd XML component of the 1st XML component of the XML form of the 3rd child CST.
    • Descendant's attribute value:
        AT:{foo=3[1.2]/bar}
      sets the attribute named foo to the value of the attribute named bar located in the 2nd XML component of the 1st XML component of the XML form of the 3rd child CST.
    • Terminal value:
      In a terminal's annotation only:
        AT:{foo=$VALUE}
      sets the attribute named foo to the value of the terminal node actually parsed.

  • Summary of attribute notation

    The full form of the annotation expression for specifying an attribute is:
    AT:{... foo=c[x1. ... .xm]/a ...}
    of which only c is mandatory.

    • If [x1. ... .xm]/a is missing, then c may be only one of:

      1. a literal string "bar"; or,

      2. a special form.

    • If [x1. ... .xm]/a is present, then the annotation is that of a rule and c must be a positive integer denoting the position a child CST for the current rule (a position in the rule's RHS). It refers to the XML tree of child CST at that position. If [x1. ... .xm] is present, the xi is a sequence of dot-separated positive integers, an XML tree path referencing an XML subtree; /a is required whenever c is a child position and must be the name of an attribute in the XML tree so-referenced. This annotation denotes the string value of this attribute in that XML tree.

Interpreted special forms:

In addition to the above notation (and default behavior), we provide the following conveniences to specify finer details on the XML appearance from the information present in the CST thanks to the following built-in special forms, which all starting with a dollar sign '$', followed by the (case-insensitive) form identifier and possible arguments between parentheses and separated by a legal list separator; namely, blank space, "," (comma), or ";" (semicolon).

  • Extracting the value of a terminal: $VALUE

    The notation $VALUE may appear in an XML annotation expression for either a rule or a terminal whenever the CST construct it refers to is that of a terminal. For example:

    • an attribute value string; e.g.:

           %xmlinfo TERMINAL   [ L:"BAR" N:"Foo" A:{ fuz = $VALUE } ]
      specifies that the terminal symbol TERMINAL whose print value is "Gloop" will be serialized thus:
          <Foo:BAR fuz="Gloop"/>
    • a single XML content string; e.g.:

           %xmlinfo TERMINAL   [ L:"BAR" N:"Foo" C:( $VALUE ) ]
      specifies that the terminal symbol TERMINAL whose print value is "Gloop" will be serialized thus:
      
           <Foo:BAR>Gloop</Foo:BAR>
  • Concatenating pieces of text: $TEXT

    Wherever text is expected, we may use the notation $TEXT(...) to denote the text string resulting from the concatenation of the text strings denoted by its arguments, each of which may be either a literal (possibly single- or double-quoted) string, or a reference to a text value deeper in a descendant CST's XML structure using the XML tree reference notation c[x1 ... xn]/a, where the [x1 ... xn] and /a parts are optional.

    This construct comes handy for composing a text string on the fly to make up the text value of a child or an attribute. For example, given the following annotations:

      %xmlinfo ID    [ L:"Identifier"  A:{ name  = $VALUE } ]
     
      %xmlinfo STR   [ L:"String"      A:{ value = $VALUE } ]
     
      Entry
        : STR '^^' EntryType
        [ L:"Place" A:{ type = $TEXT ( "[" 3/special "]" 3/general ) } C:( 1/value ) ]
        ;
     
      EntryType
        : ID ':' ID
        [ L:"Type" A:{ general = 1/name special = 3/name } ]
        ;
    the piece of Entry syntax:
      "bar"^^less:top
    gets serialized as:
      <Place type="[top]less">bar</Place>
    .

Checking Annotation Consistency

We need to enforce consistent number referencing in the tree addresses used in the notation - i.e., the numbers that refer to RHS nodes and XML elements (the ci's and the xi's below). Indeed, they should (be made to) obey the following necessary conditions (all easy to justify):

  • [Condition 1]

    An annotation for a terminal, or for a rule with an empty RHS, should not be allowed to use a tree address in any attribute specifier (only symbol, quoted string, or number). A terminal's CH may only contain wrappers and a reference to $VALUE

  • [Condition 2]

    In an annotation AT : {...}, the name of an attribute following an XML tree reference must be a legal attribute of the element so referenced.

  • [Condition 3]

    In a rule annotation CH : (... ci[...] ...), the number ci must be between 1 and the length of the rule's RHS.

  • [Condition 4]

    In an annotation CH : (...), two distinct occurrences of XML content references must not be allowed to be one another's prefix or duplicate address. In other words:

    • no tree address may occur more that once in the same annotation; and,

    • whenever a tree address of the form c[x1.x2. ... .xn] occurs, then neither c[x1.x2. ... .xn-1], nor c[x1.x2. ... .xn-2], ..., nor c[x1], nor c, may be allowed to occur in the same annotation.

    For example, both CH : (1 2) and CH : (1 2[1] 2[2]) are legal; but, both CH : (1 2 1) and CH : (1 2[1] 1[2]) are illegal.

  • [Condition 5]

    Note also that whenever a tree address of the form c[x1.x2. ... .xn] occurs, then for it to be consistent, this entails that the XML form of the CST node referenced by c must consist of exactly one XML element - as opposed to none or many. This is true iff the referenced RHS symbol is either a value-carrying terminal, or a non-terminal all of whose possible XML forms are each single XML elements. This must be verified statically at grammar analysis time.

Violation of any of these conditions at parser-generation time should raise an exception and be reported as an error.

If all these conditions hold, then the code for the method xmlify(Element container) defined in the class ParseNode, and the method createXmlForm(ParseNode node, Element root) defined in the class XmlInfo, is guaranteed to work safely.

DTD/Schema extraction

Note that when all annotations are consistent, we may wish to extract more static information from the annotated grammar. It is indeed possible to infer the global nature of the admissible XML documents generated from a specific annotated grammar at parser-generation time using simple static analysis of the grammar. From this we may then generate a DTD or an XML Schema describing the type of XML documents produced from serializing well-formed syntactic units. This may then be optionally adjoined to the produced XML document as a seal of verifiable well-formedness.

Extracting the types of XML elements from annotations

For verifying a property such as [Condition 2] above, it is necessary to know the XML "element type" of the referenced XML node. This "type", of the form nsPrefix:localName, may be computed statically by analyzing the grammar's annotations, and deriving the exact XML element "type" for each tree reference in the annotations. This is done as follows.

To each grammar symbol A, we associate its _xmlFormType: a RegularExpression denoting the set of possible XML element types that A may expand into when serialized into its XML form:

  • if A is an unannotated terminal that does not carry a value, then A's _xmlFormType is empty (i.e., RegularExpression.EMPTY);

  • if A is an annotated or value-carrying terminal, then A's _xmlFormType is a single XML element whose name is the symbol's name if non-annotated, or specified by the annotation, otherwise;

  • if A is a non-terminal whose ruleset is:
              A : A11 ... A1n1
                | A21 ... A2n2
                | ...
                | Am1 ... Amnm
                ;
              
    then A's _xmlFormType is the union, for each rule index i, of the Xi's, where each Xi is the _xmlFormType corresponding to the ith rule for A (for 0≤i≤m), and computed as follows:

    • if the ith rule for A is annotated, then Xi is the XML type of the single XML element specified by the annotation;

    • otherwise, Xi is the concatenation of those of the Aij's (for 0≤j≤ni).

Computing a DTD by fix-point closure of regular expressions

The idea is simply to iterate the above step until a fix point is reached (i.e., until no change occurs in any of the computed _xmlFormType's from one iteration to the next). Therefore, in order to enable effective comparison of RE's, the _xmlFormType's are kept in normal form.

Example

Consider the following annotated grammar:
  Sexp : Atom
       | '(' List ')' [ LO: List   CH: (2) ]
       ;
 
  Atom : NUMBER     [ LO: Number AT: {value = $VALUE} ]
       | NAME       [ LO: Name   AT: {value = $VALUE} ]
       | NIL        [ LO: Nil ]
       ;
 
  List :            [ LO: Nil ]
       | List Sexp
       ;
  




This file was generated on Tue Apr 30 07:44:50 CEST 2019 from file XmlAnnotationSpecification.grm
by the hlt.language.tools.Hilite Java tool written by Hassan Aït-Kaci