A General Specification of an Efficient Java Implementation
for Representing and Interpreting OSF Constraints
Hassan Aït-Kaci

HAK's Language Technology
All contents Copyright © by
Hassan Aït-Kaci—All Rights Reserved
Last modified on Thu Jan 31 15:41:49 2013 by hak
This is a high-level specification for implementing an operational
system in Java for Order-Sorter Feature (OSF) constraints. It describes
general syntactic aspects for defining a sort taxonomy and OSF terms
(also known as ψ-terms) using such a taxonomy. It discusses
considerations for an optimized operational semantics for the efficient
representation and interpretation of expressions using
ψ-terms. [N.B.: the expression "ψ-term" was
introduced originally by the author in his
PhD
thesis (Penn, 1984) as a shorthand for "property and
sort inheritance term," as a formalization of some
parts of Ron Brachman's "SI-Nets," or "Structured Inheritance
Networks" defined informally in the
latter's PhD
thesis (Harvard, 1977).]
In particular, we cover implementation details for encoding partially
ordered sorts using binary codes. We also describe the basic building
blocks of a general execution architecture using this set-up to implement
ψ-terms unification and generalisation. For general user convenience, we
also suggest useful pragmas than can come handy in a practical language
implementation for experimenting with OSF constraints.
Note that this is just an initial specification for guiding the Java
implementation of such a language whose workings are based on an
underlying concrete architecture such as the one we describe. One can
peruse the
initial set of source
files. source files is avail An actual first language prototype
testing some of the issues described in this specification is realized
using the HLT Jacc grammar meta-compiler and HLT language tools
implemented by the author. It is called "OSF V0" (for "Version
0")—see
the source
files.
To make some aspects more tangible, this specification will make
informal reference to OSF V0's specific syntax when discussing general
structural and operational issues. However, this specific syntax is not
binding for future more complete versions of OSF technology, and neither
are all the points discussed, as it is likely that variations and
departures from some details are bound to occur in future
implementations. Nevertheless, the overall structure is expected to
stand as described.
- Declaring a symbol partial ordering, where the symbols are sorts
that denote sets, and the order denotes an
"is-a" relation between these sorts (i.e., subset).
Such a basic declaration is of the form s1 <|
s2., which declares the sort s1
to be a subsort of the sort s2; the shorthand
notation s1, ..., sn <| s. is
equivalent to s1 <| s., ...,
sn <| s.; and similarly, s <| s1,
..., sn. is shorthand for s <|
s1., ..., s <| sn..
Possibly redundant or cyclic declarations are detected upon sort
encoding (see below). While redundant declarations are harmless and may
simply be reported as warnings, cyclic declarations must be flagged as
errors as they make no sense.
- The syntax of a ψ-term of the form: [#X:] s[(
[f1 ⇒] tn, ..., [fn ⇒]
tn )], where:
- A definition is of the
form $Id[(#X1,...,#Xn)] = t.,
where Id is an identifier, t is a ψ-term, and
the (optional) #Xi's arguments are variable tags
occurring in t. The name being defined
(i.e, $Id) may not appear in the ψ-term defining
it (i.e, t), nor in any other ψ-term preceding
this definition (see why). This facility is
only provided as a macro-like convenience and has no formal meaning
other than syntactic shorthand. An occurrence
of $Id(#Y1,...,#Yn) is just an
abbreviation for a new copy of the ψ-term it stands for
(i.e, t) in which #Yi is
substituted for #Xi in t, and all other
tags in t have been consistently renamed afresh. (By
"consistently renamed," we mean that all occurrences of a tag
must be renamed with a same tag.) If a defined name has no tag
arguments, then the parentheses are omitted, writing $Id
rather than $Id(). The specific lexical category of
type $Id is provided for such names not to be confused with
sorts, features, and tags.
- A dyadic ψ-term GLB (unification) operation e1
∧ e2, which denotes the intersection of the sets
denoted by each term; and a dyadic ψ-term LUB (generalisation)
operation e1 ∨ e2 that computes the
most specific ψ-term subsuming both its arguments (which denotes a
superset of the union of their set denotations). The
expressions e1 and
e2 are either explicit ψ-terms or (presumably
defined) names.
- A feature projection operation e/f, where e is a
ψ-term or a (presumably defined) name, and f is a feature
symbol. This denotes the ψ-term under that feature f of
that ψ-term, or @ if it does not possess this
feature.
- If the same feature symbol appears more than once as the argument of a
ψ-term, e.g., s(f ⇒ t, ... f ⇒ u), it is just
shorthand for s(f ⇒ t∧u, ...).
- As for the lexical categories:
- The topmost sort is represented by the symbol '@'; it
means "anything" as it denotes the set of all possible
things. The bottom sort is represented by the symbol '{}';
it means "nothing" as it denotes the empty set. They are both
built-in sort symbols. For obvious reasons, neither @
nor {} may appear in a
<| declaration.
- Other than @ and {}, a sort symbol may be any
identifier, or a number.
- Number sorts may not appear in a <| declaration. They
are implicit subsorts of the built-in sorts Integer and
FloatingPointNumber, which is are both subsorts of
Number, whose only super-sort is @. These three
sorts may not appear in a subsort declaration. [N.B.: this
may change in future versions when sort theories are supported as
some <| declarations involving these may make sense; for
example, EvenInteger <| Integer. In this case, the theory
for EvenInteger would specify that its instances must be
divisible by 2. However, we will disallow it in early versions such
as OSF V0 since it does not support sort theories. We will discuss
this more in detail later.]
- A feature symbol may be any identifier or a non-zero natural
number (i.e., an unsigned positive
integer). The nth missing feature in the argument
list of a ψ-term is simply the integer
n. For example foo(bar, boo ⇒ buz, fuz,
hum) is just shorthand for foo(1 ⇒ bar, boo ⇒
buz, 2 ⇒ fuz, 3 ⇒ hum).
- A variable symbol (or tag) starts with the symbol '#'
followed by a non-empty sequence of letter, numeral, or _
(i.e., underscore) characters.
- There are several simplified notations of the syntax of sorts and
ψ-terms. Here are some a few common such shorthands.
- If a ψ-term has an empty body (i.e., it has no
subterms), the parentheses for its body may be omitted
(i.e., #X : s() may be simply written #X :
s).
- A ψ-term may be just a tag #X followed by nothing, in
which case this is shorthand for #X : @.
- A disjunctive sort expression containing only one sort is just that
sort; i.e., {s} is simply s.
- An empty disjunctive sort expression is just the empty sort {}.
Sort Encoding
In this section, we cover in some detail practical aspects for the efficient
representation and use of multiple-inheritance in a large sort taxonomy,
and the machinery for encoding partially ordered sorts as binary codes
to
optimize lattice operations on them. We first give the details for
implementing a basic technique based on transitive closure. Then, we develop
data structures and methods for implementing a more elaborate technique
capable of handling very large taxonomies.
Transitive Closure Encoding
Note that no operation (either on sorts or ψ-terms) may be
performed before the complete set of sorts has been encoded. Sorts
are recorded from their appearances in declarations. However, other
undeclared sorts appearing in ψ-terms definitions must also be so
recorded. Therefore, any attempt to perform any operation needed
information on sorts will prompt the user on whether compilation
should proceed. (See also the %encode and
%automatic pragmas below.)
Once the sort taxonomy has been encoded, any new sort declaration extending
the taxonomy, or any definition involving a new sort symbol will prompt the
user for the need to reencode the sort taxonomy before evaluating any
expression—unless the taxonomy is explicitly
"de-encoded"—i.e. all sort codes are erased. (See also
the %clear
and %extend pragmas below.)
In such a language, the sort taxonomy consisting of all the sorts is
encoded into bit vectors. This does not apply to literal values (such as
numbers, strings, etc.) but only to symbolic sorts. All the symbolic
sorts are numbered in the order in which they appear in a program and stored
in an array called (say) TAXONOMY of objects of
class Sort at the index corresponding to their order of
appearance. [N.B.: In fact, to allow for dynamic growth if needed,
it will be an ArrayList rather than, strictly speaking, an
array. But this is just a detail and we use "array" in this specification
for simplicity.] In this TAXONOMY array, the built-in symbolic
sorts are initialized as follows:
- Integer is stored in TAXONOMY[0];
- FloatingPointNumber is stored in TAXONOMY[1].
- Number is stored in TAXONOMY[2];
The bottom and top sorts ({} and @) are not stored
in TAXONOMY. They are instead defined as constant sorts.
[N.B.: In this specification, we limit the built-ins to
numbers, but actual implementations may (and hopefully will) include
other built-in sorts
(e.g., String, Boolean, List, etc.,
...). As well, interpeted operations on literals such as arithmetic,
string manipulation, etc., and related issues will be
discussed later.]
All other non-built-in sorts can then be stored from index 3 and upwards
in the order they come. In order to allow for more built-in sorts in
future versions (say, List, Nil,
String, Boolean, True, False,
etc., ...) rather than referring to index 3 explicitly, we
shall define a int constant called MIN_INDEX
denoting the least index available for non-built-in sorts in the array
TAXONOMY. Thus, in this specification, MIN_INDEX =
3.
The class Sort has the following fields:
- a String name representing the name of the sort;
- an int index representing the sort's index in the array TAXONOMY;
- a BitCode
code representing the bit vector code encoding the sort (a subclass of BitSet
rather than BitSet itself for reasons explained below).
We shall ensure that the built-in sorts are encoded once and for all as follows:
- the code of sort {} has all its bits set to false;
- the code of sort @ has all its bits set to true;
- the code of sort Integer has its
1st bit set to true, and all others set to false;
- the code of sort FloatingPointNumber has its 2nd
bit set to true, and all others set to false;
- the code of sort Number has its 1st,
2nd, and 3rd bits set to true, and all others set to false.
Before the complete sort encoding may be performed, a
Sort object is created for each sort in the order is appears
and stored in the TAXONOMY array at the next available index
in TAXONOMY, with its name and this index. As it is created,
a Sort object of index n greater or equal to
MIN_INDEX, is initialized to have the nth
bit of its code set to true.
When all Sort objects are thus stored in the array
TAXONOMY, a transitive closure procedure on all sorts of index
greater or equal to MIN_INDEX must be performed to reflect the
declared ordering for non-built-in sorts. In order to do so,
whenever TAXONOMY[i].name has been declared to be a subsort
of TAXONOMY[j].name, then the jth bit
of TAXONOMY[i].code must be set to true.
After that, a standard transitive closure is performed on the codes of
all elements of index higher than MIN_INDEX to obtain the
final set of codes for all the sorts. Using Warshall's
algorithm (see also here), this gives:
int n = TAXONOMY.size();
for (int k = MIN_INDEX; k < n; k++)
for (int i = MIN_INDEX; i < n; i++)
for (int j = MIN_INDEX; j < n; j++)
if (!TAXONOMY[i].code.get(j)))
TAXONOMY[i].code.set(j,TAXONOMY[i].code.get(k) && TAXONOMY[k].code.get(j));
N.B.: The java.util.BitSet
ensures that the size of bit sets is increased as needed. However,
extending the size of a BitSet is done so the new bits added
to it are higher-end bits and are initialized to
false. This may be fine for all the sort codes except for that
of @ since it must always have all its bits set to
true. It is also problematic when using Boolean negation
on a code as the upper bits will then be flipped
to true—thus giving way to nonsensical interpretations.
Therefore, it is necessary to limit the size of relevant bits in
a BitCode so as to manipulate only the lower bits that make
sense. In particular, we must ensure that @ has all and only those
bits within this size set to true. This size is the number of
defined sorts. Hence, we must set the code of @ only after all
other codes have been computed so that we know the maximum BitSet
size (i.e., TAXONOMY.size()) used to allow
setting @'s code correctly to BitCode.allSet(TAXONOMY.size()).
We must be careful to make Boolean operations (especially the not
operation) on codes operate only on the bits of index lower
than TAXONOMY.size(). This is another reason why we need to
define
the BitCode
subclass
of java.util.BitSet
to provide for dyadic and, or, and not
Boolean methods that are aware of the size of the codes to
maintain always all others to false .
For large or sparse taxonomies, this encoding method can be further
optimized (see the following section on modulated
encoding).
This section discusses implementation issues for sort encodings is large
taxonomies using a technique
of code
modulation taking advantage of sparse or barely disconnected sort
hierarchies which happen frequently in practice.
Modulated encoding is technique taking advantage of the topology
of the taxonomy. The basic idea of modulated encoding that a taxonomy
can be partitioned into modules of tightly related elements such that
only a sparse number of "is-a" links cross over from one module to
another. Then, the set of modules can be viewed itself as a
"meta"-taxonomy of module elements that can be encoded using the method
above. The taxonomy inside each module can then be encoded independently
of the others modules. This ends up in providing the sorts with a
lexicographic encoding. This reduces the size of codes at the expense of
slightly less efficient boolean lattice operations. Comparing two
elements first compares the codes of their modules, then if that result
in consistent module, proceeding with the comparison in the resulting
module. The efficiency in code size and code operations depends of the
"quality" of the modulation of the original ordering. This method is
worth using for very large sparsely connected taxonomies (which are very
common). This modulation can be reiterated to have modules of
modules, etc., ...
[To be completed later...]
In essence, once the set of sorts has been encoded as explained
above, it is possible to proceed to interpreting sort expressions
using bit vector operations on the underlying sort codes. However,
there are some practical issues to consider.
As they are being declared, sorts are parsed as identifiers which
are then encapsulated as Sort objects stored in the
TAXONOMY array at their first appearance. Each sort name
must correspond to a unique Sort object. Therefore, a later
appearance of a sort name must be able to retrieve its corresponding
Sort object stored in TAXONOMY. Rather than
sweeping this array each time with a linear search, it is more
efficient to use a hash table called NAMEDSORTS associating
String sort names to the Sort objects they correspond
to. Thus, at the same time a Sort object is stored in the
array TAXONOMY, it is also recorded in the
hashtable NAMEDSORTS as the value of the its unique
(i.e., interned) String name. It is important to have
the equals method and the hash code of a Sort object
depend only on the sort's name, and neither on its
index nor on its code fields as these are both
subject to change during the encoding and reordering of the array
TAXONOMY (see below). For this reason, Sort.name will
always be set to its intern() unique version so that they may
compared with == rather than equals.
It is possible (if only as a mistake) that some is-a
declarations introduce cycles in the ordering. Detecting such cycles is
simple once the sort encoding has been performed by transitive
closure. This is explained next.
- Observe that the existence of a cycle in the ordering will imply
that all the sorts in the cycle have equal code after the transitive
closure encoding is performed. This means that all these sorts must
denote the same set, and should be one and only sort. Such a cycle
could either be flagged as an error, or all the sort symbols in this
cycle should be taken as synonymous (in effect taking the quotient
set with respect to the equivalence relation on sorts such cycles
create). In this version, the first option is taken, and such a cycle
is flagged as an error. This is because one presumably would not
intentionally declare a cyclical sort taxonomy.
- In order to detect potential cycles, we proceed
as follows. Once encoded by transitive closure on the sort ordering,
the part of the array TAXONOMY containing non built-in sorts
(i.e., starting at index MIN_INDEX) is sorted using a
topological ordering whereby a Sort object s is said
to precede another Sort object t iff, in this
order:
- s.code is strictly lower than t.code;
or
- s.code == t.code and
s.index < t.index;
or
- t.code is not lower than t.code
(i.e., s and t are unrelated)
and
- s.code.cardinality() == t.code.cardinality() and
s.index < t.index
or
- s.code.cardinality() < t.code.cardinality().
A code c1 is deemed lower than a code
c2 whenever (c1 AND c2) ==
c1. The expression c.cardinality() for a code
c denotes this code's cardinality (i.e., its
number of bits set to true).
Sorting the array TAXONOMY (with,
say, QuickSort)
using the above "precedes" relation will always result in an array where
distinct sorts of equal codes (if any) end up being
contiguous. Therefore, since sorts of equal code form a cycle, cycles
can be detected and reported in one single sweep. This reordering will
also rearrange the sorts so as to reflect the sort taxonomy where lesser
sorts will always be at a lower index than their parents.
Note that, after the above topological reordering of the
array TAXONOMY, a sort object's index field will no
longer correspond to its position in the TAXONOMY array. This is
annoying if we wish to have direct access to a sort by its position in this
array. Resetting each sort index field to match its new position rank in
the TAXONOMY array would not be satisfying either since it would no
longer allow identifying the position of which subsort corresponds to
which true bit in its code (see for example, when computing
the descendants of a sort, or its
height in the taxonomy). For this reason, we shall
maintain a new int[] array called RANK such
that RANK[i] contains the position in TAXONOMY of the sort
of index i. In other words, RANK[s.index] is the position
of sort s
in TAXONOMY; i.e., TAXONOMY[RANK[s.index]] = s.
This RANK array is initialized after sort encoding and reordering
while sweeping the TAXONOMY array checking for cycles.
Sort codes allow fast Boolean lattice operations on the sort taxonomy
so that intersection, union, and complement are optimized. Note
however that such operations may result in a code for which there is no
explicit declared sort. This is fine as long as it is not necessary to
access explicitly declared sorts that appear in the sort ordering.
However, when this is eventually necessary—if only for printing
results—we will need to retrieve which explicit sorts a code
corresponds to. When this code corresponds to an actual sort in
TAXONOMY, then it is returned. Otherwise (i.e., when
there is no explicit sort with such a code), the answer for this code is
the disjunctive set of sorts possessing a code immediately lower than the
given code. In other words, the sort of a non-existing sort's code is
retrieved as the sort expression that is the union of its maximal lower
bounds.
In order to allow for such retrieval, we shall use a hash table
called CODECACHE, as a cache associating sort codes to the set
of all Sort objects that are its maximal lower bounds of
explicitly declared sorts existing in the encoded taxonomy. If not
stored already, such a set is computed from the TAXONOMY
array.
To compute such a set of maximal lower bounds of a given code, we sweep
the TAXONOMY array upwards starting from index
MIN_INDEX. As we proceed upwards, we collect all lower bounds of
the given code in a set, pruning this set anytime a code higher than one of
its elements is found, which is then added to the computed set. Note that
there is no need to go higher than the first sort whose code is the given
code or that of a super sort of the given code (because the order of the
sorted array ensures that all sorts of higher index are greater or
unrelated). Thus, the final set of Sort objects we end up with is
that of the code's maximal lower bounds, which we can then store in the
hash table cache CODECACHE as an image of the given code for
future potential retrieval of this code's maximal lower bounds without
having to recompute it.
One important note is worth making regarding how to represent sets of sorts
stored in CODECACHE as potentially non-unique sorts associated to
a bit code. Such sets of sorts are also needed for printing out results
when decoding bit codes. It is possible of course to use aggregative
objects of Java type Sort[], ArrayList<Sort>, or
HashSet<Sort>. However, even though any of those would work
for our purpose, they would be unnecessarily wasteful in terms of memory
space and processing time. For example, they would lead to awkward set
operations, which are needed in many of
the taxonomy pragmas used to explore a
sort taxonomy (e.g., computing a
sort's children/parents, or
its siblings, mates,
similars, etc.,...).
So, rather than using such space-bulky and time-inefficient aggregate Java
types, it is simpler and more efficient to use a BitCode. Indeed,
all we need is to represent a set of sorts in canonical form; i.e.,
a maximal disjunctive set of incomparable sorts of the
form {s1; ...; sn}. Hence, a bit code will
precisely represent such a disjunctive sort when
its jth bit is set to true iff
j == si.index, for all n
= 1,...,n. Retrieving any sort member si of
such a set is then given simply by TAXONOMY[RANK[j]],
where j is the bit set to true
for si. If such a bit code c is such
that c.cardinality() == 1 then it corresponds the unique defined
sort given by TAXONOMY[RANK[c.nextSetBit(0)]].
Sort expressions as defined above are essentially
Boolean expressions. Evaluating such expressions can be quite efficient once
the sort taxonomy has been encoded as we can rely on the underlying Boolean
lattice of BitCodes. However, there are a few subtle issues to
consider in order to keep these operations as efficient as they should
be. Following is an overview of a setup for evaluation of sort expressions
that will have been parsed as expressions involving sort names.
-
Evaluation of a sort expression is done in the context of an evaluation
context. Such a context may be defined as a class called,
say, Context. An instance of such a Context will serve
as a environment of evaluation for expressions. So we will assume such
an instance object called, say, CONTEXT that consists of at
least the three structures defined above:
- the array of sort codes TAXONOMY;
- the table of sorts name NAMEDSORTS;
- the cached codes' table CODECACHE;
Such a context will also contain other components necessary for
accessing global information pertaining to a sort taxonomy and methods
that may need this information. It may be construed as a
compilation/evaluation environment for building, accessing, and using
the shared taxonomic information (such as error manager, display manager,
evaluation stack, etc., ...).
Thus, in particular, such a CONTEXT structure must be
accessible to the parser for declaring and encoding sorts as described
above.
- Concerning the evaluation of sort expressions, the idea is to rely
on the sorts' bit vector codes, once the sort taxonomy has been
encoded. Therefore, a possible evaluation scheme could proceed as a
simple Boolean calculator interpreting sort expressions
in CONTEXT's environment as they are being parsed once the
sorts have been encoded, checked for cycles, and reordered to ease
code retrieval, and the tables NAMEDSORTS
and CODECACHE have been initialized with the declared
sorts.
Indeed, such an interpreter would only require Boolean operations on
codes. Thus, a sort expression could be evaluated from the result of
evaluating its subexpressions. The leaves of such expressions being
sort names, this scheme would have to use the corresponding
BitCodes retrieved from NAMEDSORTS and produce a
BitCode result for each nested sort expressions. Eventually,
the outermost expression would result in a BitCode, which
could be retrieved/cached from the CODECACHE table.
Hovever, such an interpretation scheme necessitates that
non-destructive Boolean operations be used on the bit
codes. Because, if not, this would destroy the sort codes stored
in TAXONOMY. So, it would be a wasteful scheme as it would
require copying each argument a Boolean operation on BitCodes
before performing the operation. Hence, the whole intent of optimal
use of bit vectors for fast in-place Boolean operations would become
moot—not to mention the space waste that such copying would generate.
- Therefore, we need to minimize unnecessary copying while evaluating
a Boolean sort expression on bit codes while ensuring no destructive
side effects on the sort codes in the encoded taxonomy. A
simple compilation scheme can thus be devised in
a CONTEXT's environment. The idea is that, instead on the
naive intrepretation scheme described above that would evaluate each
subexpression as it is read, performing this evaluation can be delayed
until the outermost expression has been parsed and
constructed.
Such a scheme would then compile this outermost expression and its
subexpressions in postfix form so as to enable a stack-based
evaluation. The point is to use a reusable copy of a sort code
whenever possible. This may be done as described next.
- Implement a locking mechanism on a BitCode indicating
whether the bit code may be modified in-place. In particular,
the codes of the encoded taxonomy should all be locked.
- Implement static Boolean methods on bit codes for and,
or, and not (which are all we need) in such a way
that whenever at least one of their arguments is unlocked, reuse
this argument in place and push it back on the evaluation stack
as the operation's result. Otherwise, copy only one of the
arguments before effectuating the operation in-place and reuse
the same unlocked bit code for the result as well. Therefore,
the result of all operations pushed on the evaluation stack will
always be unlocked and thus reusable in place.
- After that, all other operations on non-leaf arguments in the
expression will be garanteed to need no argument copying since
one of their arguments will always be reusable.
With such a scheme, evaluating a postfix expression pushed on the
evaluation stack would be such that the minimal amount of copying
would be ensured. The final result of evaluating the expression so
compiled into a postfix expression stack ends up being the last value
pushed on the evaluation stack when reaching the bottom of the postfix
expression stack, and this value is always unlocked. So it may then be
locked if needed and returned.
This scheme results in just one code copy per dyadic operation having
all their arguments be declared sort symbols. For example, only one
copying is required for evaluatiing the disjunctive
sort {s1; ...; sn} for any n
(that of s1). The sort
expression (s1 & ... & sn)\(t1 |
... | tm), where the si's
and tj's are sort symbols, will require only 2
copying for any n and m
(those of s1 and t1).
Before being interpreted, a ψ-term is represented as a data
structure, which is described next. Such a structure is built by parsing
its syntax, ensuring that all such a structure's properties are kept
consistent with its definition prior to being interpreted.
A ψ-term structure is built out of the syntactic expression
described above. Such a syntactic structure is represented as
an instance of the class PsiTerm. The basic structure of this class
consists of:
- a field tag of type Tag;
- a field sort of type SortExpression;
- two tables:
- a field positions of type ArrayList containing
ψ-terms for its positional subterms;
- a field features of type HashMap associating features
of type Feature to ψ-terms for its featured subterms.
A tag is essentially a logical variable. In other words, it may be unbound, or
bound to another tag.
Its essential structure consists of:
- a field name of type String, which identifies the tag's name;
- a field ref of type Tag, which identifies a tag that it is
bound to. As in Logic
Programming, we will follow the convention that an unbound tag is
self-referential.
- A tag also has a field term of type PsiTerm which,
when not null, refers to the ψ-term of which it is
the tag field. A tag with a null term field
is called a "free" tag, which means that it is not the tag of
any ψ-term (although it may be bound to one that is).
Thus, a tag uses a Tag deref() method that follows the chain
of ref links until reaching the ultimate unbound (i.e.,
self-referential) tag, which it returns. Therefore, anytime accessing a
tag, it must be dereferenced to get to its actual identity.
A feature is just a name—and could thus be represented as
a String. However, for reasons that will become apparent when we
introduce typing constraints for features, we represent them as instances of the
class Feature, which (for now) has only one field name, of
type String.
Building a ψ-term structure by a parser from its written syntax must
essentially ensure that (1) features and positions are functional
(i.e., that a given feature or position appears at most once as
argument of the ψ-term); and, (2) tagging is consistent
(i.e., that a given tag refers to the same ψ-term
(sub-)structure no matter where it occurs). Since this must be done
prior to sort interpretation, we proceed as follows.
If a feature (or position) that has already been built as a ψ-term's
argument (say, as f ⇒ t) occurs again as an argument to
the same ψ-term (say, as f ⇒ t'), then the structure
of t' is merged into that of t by:
-
setting t'.tag.ref to t.tag (i.e.,
binding t'.tag to t.tag);
-
setting t.tag.sort to t.sort & t'.sort (i.e.,
the uninterpreted conjunction of the two sort expressions—as a
simplification, if one of the sort expressions is @, there is
no need to generate a new sort expression; i.e., rather than
generating an expression of the form t & @ or t & @;
we leave it as just t);
-
adding all the subterms of t' to those of t, merging
multiply occurring features or positions as done for this feature
or position.
The same subterm merging procedure is applied for multiply occurring
tags. In that case, a subterm is first built using a new tag, then this
newly built subterm is merged with the reoccurring tag's term. This
ensures consistent term sharing, including circular coreferences.
For example, parsing the following ψ-term syntax:
#P : person
( id => @
( first => "John"
)
, id => name
( last => #S
, first => string
)
, spouse => married_person
( address => #A : location
)
, spouse => @
( id => name
( first => "Jane"
, last => #S : "Doe"
)
, id => name
( first => string
)
, spouse => #P : married_person
( address => #A
)
)
)
will yield the following uninterpreted ψ-term structure:
#P : person & married_person
( id => name
( first => "John" & string
, last => #S : "Doe"
)
, address => #A : location
, spouse => married_person
( id => name & name
( first => "Jane" & string
, last => #S
)
, spouse => #P
, address => #A
)
)
Note that, because of term merging and the fact that a ψ-term is
uniquely identified by its tag, the "true" identity of a ψ-term must
be retrieved through its dereferenced tag. This is especially
relevant when accessing subterms of a ψ-term. Thus, just as for
tags, a PsiTerm deref() method is provided for this
purpose. For a ψ-term t, invoking t.deref() is
equivalent to invoking t.tag().deref().term(). Therefore,
anytime accessing the actual structure of a ψ-term, it must first be
dereferenced to get to its actual identity.
Once built, an uninterpreted ψ-term structure may be interpreted by
evaluating the sort expressions at each node. In order to do that, the
sort taxonomy must first be encoded. If the intepretation process
yields {} for any node, then the whole ψ-term's value
is {}.
Once interpreted, the pretty-printer must be adapted to decode the
sort expression's values at each node resulting from their evaluation.
[To be completed later...]
This is a facility only meant as a pragmatic convenience for a user to avoid
having to type in a large ψ-term several times if needed. A name
definition associated to a ψ-term has global scope. However, each
occurrence of a defined name in an expression stands for a new ψ-term:
the one it defines up to renaming of its tag variables. In particular
several occurrence of the same name will not share variables before
interpretation. (After interpretation, some of these tag variables may get
unified.) Indeed, the scope of defined names is global while tag variables
appearing in an expression are local to that expression. In order to avoid
infinite recursive expansion issues, such a defined name may not be allowed
to appear inside an expression definition before being itself
defined—including in its own definition expression.
[To be completed later...]
Next, we elaborate on operations meant as facilities for a user to explore
or set parameters of an OSF sort definition and evaluation
environment. These are dubbed pragmas as they are, strictly speaking,
not componets of an OSF language, but pragmatic tools related to an OSF
envirironment. They take the form of extra-linguistic operators whose
effect is to set parameters of, or return information about, the state an
OSF environment. Such a pragma is a built-in construct of the
form %pragma where "pragma" is the name of the pragma,
possibly followed by arguments.
Taxonomy pragmas
Taxonomy pragmas are operators meant to explore a declared sort
taxonomy. Useful pragmas for exploring taxonomic properties are listed
below. This is neither a complete list, nor are all of these guaranteed to
be provided due to non-trivial computational complexity required by some
(e.g.,
%width).
- %size—This prints the number of defined sorts in the
sort taxonomy.
- %sorts—This prints the array of all defined sorts and
their codes.
- %height s—This prints the size of the longest
is-a sort chain from {} up to to s. Thus, the
height of the complete taxonomy is given by %height @, or
simply %height (i.e., without argument).
- %depth s—This prints the size of the shortest
is-a sort chain from @ down to s. Thus, the depth
of the complete taxonomy is given by %depth {}, or
simply %depth (i.e., without
argument). Clearly, %depth {} and %height @ must be
equal, and equal to %depth s + %height s for any
sort s.
- %width s—This prints the size of the widest
is-a sort antichain containing s; that is, the
size of the largest set of mutually unrelated sorts having s as
one of its elements. (Note that this is not necessarily the size of the
set computed by %unrelateds s; but it
will always be greater or equal to the size of %unrelateds s.)
The width of the complete taxonomy is given by %width
(i.e., without argument).
- %children s—This prints the set of maximal strict
lower bounds of s (i.e., the set of immediate
subsorts of s), or {} if there are none.
- %parents s—This prints the minimal strict upper bounds
of s (i.e., the set of immediate supersorts
of s), or @ if there are none.
- %minimals—This is equivalent to %parents {}
or %heirs @.
- %maximals—This is equivalent to %children @
or %founders {}.
- %descendants s—This prints the set of strict lower bounds
of s, or {} if there are none.
- %ancestors s—This prints the set of strict upper
bounds of s, or @ if there are none.
- %heirs s—This prints the set of miminal non-{}
strict lower bounds of s (i.e., the subset of
its descendants immediately above {}), or {} if there are none.
- %founders s—This prints the set of maximal non-@
strict upper bounds of s (i.e., the subset of
its ancestors immediately below @), or @ if there are none.
- %isa s t—This returns true if s denotes
a subsort of t, and false otherwise.
- %related s t—This returns true if
either s is a subsort of t or t is a
subsort of s, and false otherwise.
- %unrelated s t—This returns true if
neither s is a subsort of t nor t is a
subsort of s, and false otherwise.
- %unrelateds s—This prints the set
of maximal sorts that are incomparable to s (i.e., the
maximal antichain containing s).
- %sibling s t—This returns true if s
and t have the same parents, and false
otherwise.
- %siblings s—This prints the set of maximal sorts that have
the same parents as s, including s itself.
- %mate s t—This returns true if s
and t have the same children, and false
otherwise.
- %mates s—This prints the set of maximal sorts that have the
same children as s, including s itself.
- %similar s t—This returns true if s
and t have both the same parents and children, and false
otherwise. Note that this does not mean that s and t
are equal sorts.
- %similars s—This prints the set of maximal sorts that have
both the same parents and children as s, including s
itself.
The children of a sort are its maximal lower bounds—except for bottom,
which is always childless: its children are just itself ({}).
In order to compute the children of a sort, given its code c,
we need to sweep the TAXONOMY array starting from {}
up to, but not past, the first sort whose code is equal to or greater
than c. As we do so, we collect any defined sort whose code is
a strict lower bound of c, removing from the collected set any
sort subsumed by a previously so-collected sort. The reason why we need
not proceed past the first sort whose code is equal to or greater
than c is thanks to the post-encoding reordering of
the TAXONOMY array, which guarantees that any sort of higher
rank will have a code that is either incomparable to or contains the
given code.
The children of a sort expression are defined as follows.
- If the expression is just a sort symbol, there is no need to
evaluate it and its children are computed as explained above.
- Otherwise, the expression is evaluated and if its value is the code of
a single sort then this is again treated as the previous
case.
- Otherwise, the set of maximal lower bounds of the expression's
value code is computed and displayed.
The parents of a sort are its minimal upper bounds—except for top,
which is always parentless: its parents are just itself (@).
Computing the parents of a sort is done similarly as for the children,
only starting from @ and proceeding downwards down to, but not
past, the first sort whose code is equal to or less than c. As
we do so, we collect any defined sort whose code is a strict upper bound
of c, removing from the collected set any sort subsuming a
previously so-collected sort.
The parents of a sort expression are defined similarly as for the
children of a sort expression, only dually.
- If the expression is just a sort symbol, there is no need to
evaluate it and its parents are computed as explained above.
- Otherwise, the expression is evaluated and if its value is the code of
a single sort then this is again treated as the previous
case.
- Otherwise, the set of minimal upper bounds of the expression's
value code is computed and displayed.
[To be completed later...]
[To be completed later...]
[To be completed later...]
The heirs (resp., founders) of a sort s are easily
obtained from the minimals (resp., maximals), which are given
by %parents {} (resp. %children @).
For %heirs s, this is done by keeping only those sorts in the set
of minimals whose index is set to true in s.code.
For %founders s, this is done by keeping only those sorts in the
set of maximals whose code has the bit corresponding to s.index
set to true.
Computing the set of descendants of a sort in a fully encoded taxonomy is quite
easy since they correspond to the true bits in its code, except for
that sort itself. Therefore, it is given by the following formula:
- descendants(s) = s.code.copy().set(s.index,false).
Contrary to the set of descendants, computing ancestors of a sort s
in a fully encoded taxonomy is not so obvious. Indeed, the method
described above relies on the fact
that s.code was obtained by transitive closure. So the positions of
the true bits in s.code give the ranks of all and only the
descendants of s.code.
For ancestors, it is not possible to proceed similarly by relying on
the false bits of s.code. Indeed, these bits only say
that the sorts positioned at these ranks are not subsorts
of s. In other words, they would only give a superset of the
set of the ancestors of s.
However, thanks to the post-encoding reordering done on TAXONOMY,
we know that all the ancestors of s, if any, must be positioned at
a higher rank than that of s in TAXONOMY. Therefore, it
is sufficient to sweep all sorts at a rank position strictly
above s's rank position in TAXONOMY and collect any sort
that has its bit corresponding to s's index set
to true. This is done as follows:
- BitCode ancestors = new BitCode();
- int index = s.index;
- int rank = RANK[index]+1;
- for (i = TAXONOMY.size(); i-->rank;)
if (TAXONOMY[i].code.get(index));
ancestors.add(TAXONOMY[i].index);
Then, upon termination, ancestors(s) = ancestors.
Computing the height of a sort in a fully encoded taxonomy can by done by dynamic
programming as follows.
- Given a sort s, height(s) is defined as the
length of the longest is-a chain from s
to {}.
- Therefore, it is given by the following recursive formula:
- if s == {} then height(s) = 0;
- else height(s) = 1 + max
{ height(TAXONOMY[RANK[i]]) | s.codei == true & i =/= s.index }.
Then, the height of the whole taxonomy is given by height(@).
In order to make this computation more efficient, we can prevent
multiple recomputation of the height of each sort in the above recursive
formula by setting a new field in the class Sort
called height to a sort's height as soon as it is computed.
Computing the width of a poset is known to be
a difficult
problem.
[To be completed later...]
We also provide for other miscellaneous useful pragmas.
- %include file—This includes the specified file.
- %encode—This triggers the encoding of
the current sort taxonomy.
- %automatic—This toggles on/off
automatic sort encoding at the first expression evaluation without
prompting the user. The default is to prompt the user.
- %last—This prints the last sort expression
that was evaluated and its maximal lower bound sort(s) in the current
sort taxonomy.
- %mute—This toggles on/off printing evaluation results
on the default output. If this is on, it makes sort encoding
automatic (whether %automatic is on or off). The default
is to print results of evaluations.
- %timing—This toggles on/off timing of evaluation of
expressions. The default is off.
- %extend—After this pragma has been invoked,
new sort declarations and sort symbols appearing in name definitions may
then be added those previously recorded, and then the taxonomy may be
reencoded.
- %clear—This erases all declared sorts (not the
built-in ones), and all name definitions. This is useful if one want to
include a new file containing new declarations and definitions unrelated
to the previous ones and restart from scratch.
There could certainly be other useful pragmas that may be added to this basic set.
Dealing with literal constants of built-ins sorts (such as numbers,
strings, etc.), requires special care.
In order to allow elements of large or infinite sets of values such as
integers, floats, strings, etc, we need to provide a means to
account for such values without the need of treating each element as
a bona fide sort. Indeed, this would be unfeasible to encode
elements of infinite sets (e.g., integers), or grossly
inefficient for large finite sets as it would require a large number of
codes of large size, each bearing only one single true bit.
Another issue regarding elements is that they may be aggregated
using a monoid composition law—i.e., a dyadic
associative operation with a neutral element. For example, for number
elements: (+,0), (×,1),
(min,-∞), (max,+∞), etc., ...
To deal with both issues, a simple data structure representing an
element sort can be used as a subclass of Sort bearing, in
addition to the sort's bit code, a value and a monoidal composition
law. The former indicates the sort of the element, and the latter how to
compose two elements of compatible sorts if ever unified. By default,
this composition law is equality—i.e., whenever the elements
have a value, it must be the same. If another law is specified and both
elements have compatible sorts and each has a value, then the two
elements are composed using it and the sorts intersected as normal sort
conjunction.
[To be completed later...]
[To be completed later...]
A polymorphic sort is a sort that takes another sort as
argument—e.g., set(α),
list(α), etc., ...—where α is
a sort variable. Such a sort may have its sort variable be
instantiated by any existing
sort—e.g., set(person), list(set(integer)), etc.,
...
[To be completed later...]
A constraint attached to a the sort in the taxonomy specifying how
features are sorted and equations among feature path is called an
OSF theory.
It first puts ψ-terms into a so-called dissolved form. Dissolving
ψ-term transforms its graph structure into a set of labelled triples
representing its edges. This is pretty much similar to the way RDF triples
are obtained from their XML representation syntax that refer to other XML
structures using tag attributes. In fact, this can be a standard internal
format for ψ-terms as well as closely related graph-data formats such as
semi-structured
data representations,
LinkedData, JSON,
JSON-LD,
etc., ...
[To be completed later...]
This addresses a method of compiling dissolved ψ-term (especially
for unification) using
a WAM-style
abstract-machine.
A basic method is described in
this technical
report. An implementation of the method done
in LIFE is
described here. This
method can be extended
to compile OSF
theories.
However, these methods have to be spelled out for an efficient Java
implementation.
[To be completed later...]
[To be completed later...]
[To be completed later...]