/ data / terml

TermL Specification
E's Term-Tree Language

Introduction

TermL (pronounced Term-EL or Ter-MEL) is E's language and system for expressing and manipulating term-trees. TermL stands for Term Manipulation Language. A term-tree is a tree of terms. A term (a node of a term tree) consists of a label (called a functor) and a sequence of child terms (called arguments). The leaves of a term-tree are simply terms with no children.

Some languages for symbol trees
XML	<add> <var>x</var> <num>3</num> </add>
JSON	{"add" : { "var" : "x", "num" : 3}}
S-Expressions	(add (var "x") (num 3))
TermL	add(var("x"), num(3))

E uses term-trees (as found in Prolog and related logic programming languages) for many of the same purposes that others use S-Expressions or XML. All are generic means for representing trees of symbols, and are useful for representing a great variety of kinds of data. Trees of symbols lie midway between sequences of symbols and graphs of symbols in expressiveness. A BNF-parser converts a sequence of symbols into a tree of symbols called an Abstract Syntax Tree, or AST. E plans to use ANTLR as its parser generator, and the ASTs produced by ANTLR can be faithfully converted to and from TermL term-trees. E thereby uses TermL as its universal AST notation.

Note: ANTLR's documentation may cause confusion here. It says that its ASTs are logically S-Expressions, and ANTLR's notation for their ASTs looks like S-Expressions. However, their ASTs are instead logically equivalent to mutable term-trees, and their S-Expression-like notation for these term-trees just adds to the confusion. See Differences from S-Expressions table below.

For each BNF grammar for parsing a sequential input language into a tree, one can derive a tree-BNF, or a grammar over trees, that describes the exactly possible trees that could result from parsing according to the original BNF. As with XML, such tree-grammars are known as schemas, and trees accepted by a given schema are valid with respect to that schema. Each schema reflects the infoset of a particular input language -- it shows what needs to be preserved after abstracting away from the sequential surface syntax. ANTLR supports tree-BNF directly. TermL does not currently make use of that support, but expects to in the future.

Again, as with S-Expressions or XML, TermL has a standard universal textual syntax -- universal in the sense that any TermL term-tree can be written in that notation. As with the others, this notation is designed for both human and machine readability. Unlike the other systems, but like other term-tree notations, TermL is designed to be familiar and natural for programmers coming to E from the C/Java/Python syntactic tradition. We expect to also define a universal binary notation for TermL that encodes exactly the same information, but in an encoding optimized only for automated use.

Syntax

This specification extends the specifications in Common Syntactic Elements.

Character classes

Name			Definition	Ascii subset
	segStart	::=	isJavaIdentifierStart	'a'..'z' \| 'A'..'Z' \| '_' \| '$'
	segPart	::=	isJavaIdentifierPart \| XML-NCNameChar	segStart \| digit10 \| '.' \| '-'

The non-ascii cases are not yet tested in the current TermL implementation.

Token Types

		Definition	Denotes
Token	::=	Integer \| Float64 \| Char \| String \| Tag \| anyof("(),[]{}:") / Chars
Float64	::=	Real64 \| # not yet implemented: '%NaN' \| '-'? '%Infinity'	A value representable in the Java subset of IEEE double precision.
Chars	::=	'\'' (charConst \| '\"' \| '\\' '\n' )+ '\''	A sequence of Unicode characters. 'xyz' is a shorthand for ('x', 'y', 'z')
ident	::=	segStart segPart*	Includes all Java and E identifiers and fully qualified names, and XML-NCNames.
special	::=	'.' ident	For keyword-like tags, such as `.char.` or `.bag.`.
uri	::=	'<' uric* '>'	Includes all URIs prior to "%" interpretation.
segment	::=	ident \| special \| uri
sos	::=	segment \| String # not yet implemented	A "segment or string".
Tag	::=	segment ('::' sos)* \| ('::' sos)+	Can also represent unresolved and resolved XML-QNames.

The ident production should accept all the keywords and identifiers allowed by most C or Algol tradition languages. This includes E -- identifiers in the E language are just like Java identifiers except they may not include a '$' (dollar sign). (E excludes the '$' for identifiers so the use of '$' in fully qualified names will be unambiguous.)

The special production allows keyword-like segments such as .bag. that can't conflict with the Java, E, or XML-NCNames, since none of these allow an initial '.'. By convention, the special names also end with a '.'. The special names are reserved for use by TermL and TermL related specifications. (Much as names beginning with "xml" are reserved for use by XML and XML-related specifications.)

In the uri production, any '\\' characters read in are converted to '/', and any '|' characters are converted to ':', so the TermL infoset does not include these as possible characters in a Tag.

Since the uri production by convention will often represent IETF-URIs, this sequence of uric characters may often contain sequences such as '%' digit16 digit16 in order to quote a non-uric character. The interpretation of such quotations must not be done by a TermL lexer and parser, but is instead left to the application if it wishes to do so. The TermL infoset includes such sequences only in their uninterpreted form.

Term Tree Kernel Grammar & Infoset

		Definition	Denotes
<term>	::=	<functor> '(' <argList> ')'	Labeled tree-node. A term-tree is a tree of such nodes.
<functor>	::=	Integer \| Float64 \| Char \| String \| Tag	Label on a tree-node. A functor's kind is a tag (see below).
<argList>	::=	( <term> (',' <term>)* )?	Sequence of ordered children of a tree-node.

Each functor has a kind, which is a tag. Tags are an enumeration of kinds -- each tag is its own kind. The four forms of literal data are each represented by their own kind -- the tags shown in the following table:

Token Type		Functor Kind
	Integer	.int.
	Float64	.float64.
	Char	.char.
	String	.String.
	Tag	Each tag's kind is itself

For a given application of TermL, such as representing ASTs from parsing a particular language, there will typically be an infinite number of possible functors, but a finite number of possible functor kinds. Each functor kind will typically correspond to an AST node type. A TermL schema (to be covered separately) should specify constraints in terms of functor kinds rather than particular functors.

In order to preserve one-to-one correspondence with ANTLR ASTs, and because it makes the TermL infoset simpler, the TermL spec allows non-tag functors with children. However, this is currently discouraged until a motivating use is found. Instead, terms with a non-empty list of children should generally have tag functors, and non-tag functors should generally be used only as leaves.

Syntactic Shorthands

	Expands to	Conventional Purpose
<functor>	<functor>()	Leaves are just childless nodes.
'[' <argList> ']'	.tuple.(<argList>)	For when a term is just a sequence of ordered children.
'{' <argList> '}'	.bag.(<argList>)	For when a term is just a bag of unordered children, ie, when the order is not significant.
<functor> '{' <argList> '}'	<functor>({<argList>}) # which further expands to <functor>(.bag.(<argList>))	For when a term is a labeled bag of unordered children.
<functor> ':' <term>	.attr.(<functor>(<term>))	For when the label on a singleton child is not about properties of the child, but about its relationship to the parent. See Quasi-JSON back from the dead (the section at the end of the linked page)
Chars	Char, ..., Char	'xyz' expands to ('x', 'y', 'z')

The first shorthand -- being able to omit an empty argument list -- is essential for human readability, and is supported by all term-tree languages. It allows us to write

add(2, 3)

rather than

add(2(), 3())

The rest support some standard conventions for representing concepts from more complex infosets using the simple concepts for TermL, as shown by the following example embeddings.

Unless stated otherwise, all text on this page which is either unattributed or by Mark S. Miller is hereby placed in the public domain.

/ data / terml

ELib E Language Smart Contracts Related

Download FAQ API Mail Archive Donate

report bug (including invalid html)