IntroductionTermL (pronounced Term-EL or Ter-MEL) is E's language and system for expressing and manipulating term-trees. TermL stands for Term Manipulation Language. A term-tree is a tree of terms. A term (a node of a term tree) consists of a label (called a functor) and a sequence of child terms (called arguments). The leaves of a term-tree are simply terms with no children.
E uses term-trees (as found in Prolog and related logic programming languages) for many of the same purposes that others use S-Expressions or XML. All are generic means for representing trees of symbols, and are useful for representing a great variety of kinds of data. Trees of symbols lie midway between sequences of symbols and graphs of symbols in expressiveness. A BNF-parser converts a sequence of symbols into a tree of symbols called an Abstract Syntax Tree, or AST. E plans to use ANTLR as its parser generator, and the ASTs produced by ANTLR can be faithfully converted to and from TermL term-trees. E thereby uses TermL as its universal AST notation. Note: ANTLR's documentation may cause confusion here. It says that its ASTs are logically S-Expressions, and ANTLR's notation for their ASTs looks like S-Expressions. However, their ASTs are instead logically equivalent to mutable term-trees, and their S-Expression-like notation for these term-trees just adds to the confusion. See Differences from S-Expressions table below. For each BNF grammar for parsing a sequential input language into a tree, one can derive a tree-BNF, or a grammar over trees, that describes the exactly possible trees that could result from parsing according to the original BNF. As with XML, such tree-grammars are known as schemas, and trees accepted by a given schema are valid with respect to that schema. Each schema reflects the infoset of a particular input language -- it shows what needs to be preserved after abstracting away from the sequential surface syntax. ANTLR supports tree-BNF directly. TermL does not currently make use of that support, but expects to in the future. Again, as with S-Expressions or XML, TermL has a standard universal textual syntax -- universal in the sense that any TermL term-tree can be written in that notation. As with the others, this notation is designed for both human and machine readability. Unlike the other systems, but like other term-tree notations, TermL is designed to be familiar and natural for programmers coming to E from the C/Java/Python syntactic tradition. We expect to also define a universal binary notation for TermL that encodes exactly the same information, but in an encoding optimized only for automated use. SyntaxThis specification extends the specifications in Common Syntactic Elements. Character classes
The non-ascii cases are not yet tested in the current TermL implementation. Token Types
The ident production should accept all the keywords and identifiers allowed by most C or Algol tradition languages. This includes E -- identifiers in the E language are just like Java identifiers except they may not include a '$' (dollar sign). (E excludes the '$' for identifiers so the use of '$' in fully qualified names will be unambiguous.) The special production allows keyword-like segments such as .bag. that can't conflict with the Java, E, or XML-NCNames, since none of these allow an initial '.'. By convention, the special names also end with a '.'. The special names are reserved for use by TermL and TermL related specifications. (Much as names beginning with "xml" are reserved for use by XML and XML-related specifications.) In the uri production, any '\\' characters read in are converted to '/', and any '|' characters are converted to ':', so the TermL infoset does not include these as possible characters in a Tag. Since the uri production by convention will often represent IETF-URIs, this sequence of uric characters may often contain sequences such as '%' digit16 digit16 in order to quote a non-uric character. The interpretation of such quotations must not be done by a TermL lexer and parser, but is instead left to the application if it wishes to do so. The TermL infoset includes such sequences only in their uninterpreted form. Term Tree Kernel Grammar & Infoset
Each functor has a kind, which is a tag. Tags are an enumeration of kinds -- each tag is its own kind. The four forms of literal data are each represented by their own kind -- the tags shown in the following table:
For a given application of TermL, such as representing ASTs from parsing a particular language, there will typically be an infinite number of possible functors, but a finite number of possible functor kinds. Each functor kind will typically correspond to an AST node type. A TermL schema (to be covered separately) should specify constraints in terms of functor kinds rather than particular functors. In order to preserve one-to-one correspondence with ANTLR ASTs, and because it makes the TermL infoset simpler, the TermL spec allows non-tag functors with children. However, this is currently discouraged until a motivating use is found. Instead, terms with a non-empty list of children should generally have tag functors, and non-tag functors should generally be used only as leaves. Syntactic Shorthands
The first shorthand -- being able to omit an empty argument list -- is essential for human readability, and is supported by all term-tree languages. It allows us to write add(2, 3) rather than add(2(), 3()) The rest support some standard conventions for representing concepts from more complex infosets using the simple concepts for TermL, as shown by the following example embeddings. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Unless stated otherwise, all text on this page which is either unattributed or by Mark S. Miller is hereby placed in the public domain.
|