ERights Home elang / quasi / xml 
No Previous Sibling No Next Sibling

XML Encoded in Term Trees

The current version of this page has moved here.

The text below remains, for the moment, for historical interest.

This document and design is directly and shamelessly derived from SXML revision 2.1 by Oleg Kiselyov. SmallML is to Term Trees and E as SXML is to S-expressions and Scheme. SmallML itself is by Mark Miller

This page specifies revision 2.1 of SmallML. (We hope to track further revisions of SXML, but retain the right to diverge.) SmallML is an abstract syntax tree of an XML document. SmallML is also a concrete representation of the XML Infoset in the form of Term Trees. The generic tree structure of SmallML lends itself to a compact library of combinators for querying and transforming SmallML.

  1. Introduction
  2. Notation
  3. Grammar
  4. SmallML Tree
  5. Namespaces
  6. Normalized SmallML
  7. Examples
  8. Acknowledgment
  9. References


An XML document is essentially a tree structure. The start and the end tags of the root element enclose the whole content of the document, which may include other elements or arbitrary character data. Text with familiar angular brackets is an external representation of an XML document. Applications ought to deal with its internalized form: XML information set, or its specializations. This form lets an application locate specific data or transform an XML tree into another tree, which can then be written out as an XML, HTML, PDF, etc. document.

XML information set (Infoset) [XML Infoset] is an abstract data set that describes information available in a well-formed XML document. Infoset is made of "information items", which denote elements, attributes, character data, processing instructions, and other components of the document. Each information item has a number of associated properties, e.g., name, namespace URI. Some properties -- for example, 'children' and 'attributes' -- are (ordered) collections of other information items. Infoset describes only the information in an XML document that is relevant to applications. The default value of attributes declared in the DTD, parameter entities, the order of attributes within a start-tag, and other data used merely for parsing or validation are not included. Although technically Infoset is specified for XML it largely applies to HTML as well.

The hierarchy of containers comprised of text strings and other containers greatly lends itself to be described by Term Trees. Term Trees are easy to parse into an internal representation suitable for traversal. They have a simple external notation (albeit with many a parentheses), which is relatively easy to compose even by hand.

SmallML is a concrete instance of the XML Infoset. Infoset's goal is to present in some form all relevant pieces of data and their abstract, container-slot relationships to each other. SmallML gives the nest of containers a concrete implementation as Term Trees, and provides means of accessing items and their properties. SmallML is a "relative" of XPath [XPath] and DOM [DOM], whose data models are two other instances of the XML Infoset. SmallML is particularly suitable for E-based XML/HTML authoring, XPath-like queries, and tree transformations. In John Hughes' terminology, SmallML is a term implementation of evaluation of the XML document.



We will use an Extended BNF Notation (EBNF) employed in the XML Recommendation [XML]. The following table summarizes the notation.

An optional thing
Zero or more things
One or more things
thing1 | thing2 | thing3
Choice of things
thing1, thing2, thing3
Sequence of things
A non-terminal of a grammar
A terminal of the grammar that is a tag
A terminal of the grammar that is a string
A literal tag
<A> ( <B>* )
A Term Tree whose functor is <A> and whose arguments are zero or more <B>s
A symbol whose string representation consists of all characters that spell <A> followed by the colon character and by the characters that spell <B> . The MAKE-SYMBOL() notation can be regarded a meta-function that creates symbols.


[1]  <TOP> ::= %TOP% ( <namespaces>?, <PI>*, <comment>*, <Element> )

This Term Tree stands for the root of the SmallML tree, a document information item of the Infoset. It contains the root element of the XML document as its only child element.

[2]  <Element> ::= <name> ( <attributes-list>?, <namespaces>?, <child-of-element>* )
[3]  <attributes-list> ::= %bag% ( <attribute>* )
[4]  <attribute> ::= <name> ( "value"? )
[5]  <child-of-element> ::= <Element> | "character data" | <PI> | <comment> | <entity>

These are the basic constructs of SmallML.

The syntactic shorthands provided by Term Tree notation allows several of these to be expressed in a more compact fashion:

[2'] <name>  expands to  <name>()
[2''] <name> { <attribute>* }   expands to   <name> ({ <attribute>* })
[2'''] which expands to   <name> ( %bag% ( <attribute>* ))
[3'] { <attribute>* }   expands to   %bag% ( <attribute>* )
[4'] <name> = "value"   expands to   <name> ( "value" )

We will use these shorthands freely in the rest of the document.

[6]  <PI> ::= %PI% ( pi-target, "processing instruction content string" )

The XML Recommendation specifies that processing instructions (PI) are distinct from elements and character data; processing instructions must be passed through to applications. In SmallML, PIs are therefore represented by nodes of a dedicated type %PI% . DOM Level 2 treats processing instructions in a similar way.

[7]  <comment> ::= %COMMENT% ( "comment string" )
[8]  <entity> ::= %ENTITY% ( "public-id", "system-id" )

Comments are mentioned for completeness only. A SAX-like XML parser [SSAX], among others, should transparently skip the comments. The XML Recommendation permits the parser to pass the comments to an application or to completely disregard them. The present SmallML grammar admits comment nodes but does not mandate them by any means.

An <entity> node represents a reference to an unexpanded external entity. This node corresponds to an unexpanded entity reference information item, defined in Section 2.5 of [XML Infoset]. Internal parsed entities are always expanded by the XML processor at the point of their reference in the body of the document.

[9]  <name> ::= <LocalName> | <ExpName>
[10]  <LocalName> ::= NCName
[11]  <ExpName> ::= MAKE-SYMBOL(<namespace-id>:<LocalName>)
[12]  <namespace-id> ::= URI | user-ns-shortcut
[13]  <namespaces> ::= %NAMESPACES% ( <namespace-assoc>* )
[14]  <namespace-assoc> ::= <namespace-id> = "URI"

An SmallML <name> is a single symbol. It is generally an expanded name [XML Namespaces], which conceptually consists of a local name and a namespace URI. The latter part may be empty, in which case <name> is a NCName: a tag whose spelling conforms to production [4] of the XML Namespaces Recommendation [XML Namespaces]. <ExpName> is also a tag, whose string representation contains an embedded colon that joins a local and a namespace parts of the name. A URI is a Namespace URI string converted to a tag. A user-ns-shortcut is an NCName chosen by the user to represent a namespace URI in the application program. The SAX-like parser should offer a user to define (short and mnemonic) unique shortcuts for Namespace URIs, which are are often long and unwieldy strings.

[2e]  <Element> ::= <name> ( <attributes-list>?, <aux>?, <child-of-element>* )
[4e]  <attribute> ::= <name> ( "value"?, <aux>? )
[15e]  <aux> ::= %aux% ( <namespaces>?, <aux-node>* )
[16e]  <aux-node> ::=   To be defined in the future

The XML Recommendation and related standards are not firmly fixed, as the long list of errata and the proposed version 1.1 of XML clearly show. Therefore, SmallML has to be able to accommodate future changes while guaranteeing backwards compatibility. SmallML also ought to permit applications to store various processing information (e.g., cached resolved IDREFs) in an SmallML tree. To allow such extensibility, we introduce two new node types, <aux> and <aux-node> . The semantics of the latter is to be established in future versions of SmallML. Other candidates for <aux-node> are the unique id of an element or the reference to element's parent. The structure and the semantics of the <aux> node is similar to those of the attribute list. Applications that do not specifically look for auxiliary nodes can transparently ignore any present and future extensions.


SmallML Tree

Infoset's information item is a sum of its properties. This makes a Term a particularly suitable data structure to represent an item. The functor of the list, a tag names the item. For many items this is their (expanded) name. For an information item that denotes an XML element, the corresponding Term's functor is the element's expanded name, and the arguments optionally begin with collections of attributes and effective namespaces. The rest of the arguments is an ordered sequence of element's children -- character data, processing instructions, and other elements. Every child is unique; items never share their children even if the latter have the identical content.

Just as XPath does and the Infoset specification explicitly allows, we group character information items into maximal text strings. The value of an attribute is normally a string; it may be omitted (in case of HTML) for a boolean attribute, e.g., <option checked> .

We consider a collection of attributes an information item in its own right, tagged with a special name %bag% . The tag '%bag%' may not occur in a valid XML name; therefore an <attributes-list> cannot be mistaken for a list that represents an element. An XML document renders attributes, processing instructions, namespace specifications and other meta-data differently from the element markup. In contrast, SmallML represents element content and meta-data uniformly -- as tagged lists. SmallML takes advantage of the fact that every XML name is also a valid tag, but not every tag is a valid XML name. This observation lets us introduce administrative names such as %bag% , %PI% , %NAMESPACES% without worrying about potential name clashes. The observation also makes the relationship between XML and SmallML well-defined. An XML document converted to SmallML can be reconstructed into an equivalent (in terms of the Infoset) XML document. Moreover, due to the implementation freedom given by the Infoset specification, SmallML itself is an instance of the Infoset.

Since an SmallML document is essentially a tree structure, the SmallML grammar of Section 3 can be presented in the following, more uniform form:

[N]  <Node> ::= <Element> | <attributes-list> | <attribute> | "character data" | <namespaces> | <TOP> | <PI> | <comment> | <entity> | <aux>

or as a set of two mutually-recursive datatypes, Node and Nodeset, where the latter is an ordered set of Nodes:

[N1]  <Node> ::= <name> ( <Nodeset> ) | "text string"
[N2]  <Nodeset> ::= ( <Node>* )
[N3]  <name> ::= <LocalName> | <ExpName> | %bag% | %TOP% | %PI% | %COMMENT% | %ENTITY% | %NAMESPACES% | %aux%

The uniformity of the SmallML representation for elements, attributes, and processing instructions simplifies queries and transformations. In our formulation, attributes and processing instructions look like regular elements with a distinguished name. Therefore, query and transformation functions dedicated to attributes become redundant.

The xml__quasiParser can convert an XML document or a well-formed part of it into the corresponding SmallML form. The parser supports namespaces, character and parsed entities, attribute value normalization, processing instructions and CDATA sections.



The motivation for XML Namespaces is explained in an excellent article by James Clark [Clark1999]. He says in part:

The XML Namespaces Recommendation tries to improve this situation by extending the data model to allow element type names and attribute names to be qualified with a URI. Thus a document that describes parts of cars can use part qualified by one URI; and a document that describes parts of books can use part qualified by another URI. I'll call the combination of a local name and a qualifying URI a universal name. The role of the URI in a universal name is purely to allow applications to recognize the name. There are no guarantees about the resource identified by the URI. The XML Namespaces Recommendation does not require element type names and attribute names to be universal names; they are also allowed to be local names.
The XML Namespaces Recommendation expresses universal names in an indirect way that is compatible with XML 1.0. In effect the XML Namespaces Recommendation defines a mapping from an XML 1.0 tree where element type names and attribute names are local names into a tree where element type names and attribute names can be universal names. The mapping is based on the idea of a prefix. If an element type name or attribute name contains a colon, then the mapping treats the part of the name before the colon as a prefix, and the part of the name after the colon as the local name. A prefix foo refers to the URI specified in the value of the xmlns:foo attribute. So, for example
     <cars:part xmlns:cars=''/>
maps to
Note that the xmlns:cars attribute has been removed by the mapping.
Using James Clark's terminology, SmallML as defined by [N1] is precisely that tree where element type names and attribute names can be universal names. According to productions [N3] and [9-12], a universal name, <name> , is either a local name or an expanded name. Both kinds of names are tags. A local name has no colon characters in its spelling. An expanded name is spelled with at least one colon, which may make the identifier look rather odd. In SmallML, James Clark's example will appear as follows:
or, somewhat redundantly,

Such a representation also agrees with the Namespaces Recommendation [XML Namespaces], which says: "Note that the prefix functions only as a placeholder for a namespace name. Applications should use the namespace name, not the prefix, in constructing names whose scope extends beyond the containing document."

It is clearly unwieldy to deal with tags such as <>:part . Therefore, an application that invokes a SAX-like parser may tell the parser to map the URI to an application-specific namespace shortcut user-ns-shortcut , e.g., c . The parser will then produce
To be more precise, the parser will return just
If an application told the parser how to map , the application can keep this mapping in its mind and will not need additional reminders.

We must note there is a 1-to-1 correspondence between user-ns-shortcut s and the corresponding namespace URIs. This is generally not true for XML namespace prefixes and namespace URIs. A user-ns-shortcut uniquely represents the corresponding namespace URI within the document, but an XML namespace prefix does not. For example, different XML prefixes may specify the same namespace URI; XML namespace prefixes may be redefined in children elements. The other difference between user-ns-shortcut s and XML namespace prefixes is that the latter are at the whims of the author of the document whereas the namespace shortcuts are defined by an XML processing application.


Normalized SmallML

Normalized SmallML is a proper subset of SmallML that permits efficient processing. An SmallML document in the normalized form must satisfy a number of additional restrictions. The first restriction makes <attributes-list> mandatory (cf. Section 3, production [2]). If an element has no attributes, <attributes-list> shall be specified as %bag%. In normalized SmallML, <comment> and <entity> nodes must be absent. Parsed entities should be expanded, even if they are external. A node <namespaces> may appear only in a %TOP% element.

SGML provides two equal forms for boolean attributes: minimized, e.g., <OPTION checked> and full, <OPTION checked="checked"> . XML mandates the full form only, whereas HTML allows both, preferring the former. SmallML supports the minimized form along with the full one: OPTION{checked} and OPTION{checked="checked"} . Normalized SmallML however accepts only the full form.


Simple examples:
     some-name           # An empty element without attributes
     some-name(%bag%)    # The same but in the normalized form

The figure below shows progressively more complex examples.

     <WEIGHT unit="pound">
       <NET certified="certified">67</NET>
            NET({certified="certified"}, 67),
     <![CDATA[<BR>]]]]>&gt; </P>
     %TOP%(P("<BR>\n<![CDATA[<BR>]]> "))

 An example from the XML Namespaces Recommendation
     <!-- initially, the default
          namespace is 'books' -->
     <book xmlns=''
       <title>Cheaper by the Dozen</title>
       <!-- make HTML the default namespace
            for some commentary -->
          <p xmlns='urn:w3-org-ns:HTML'>
            This is a <i>funny</i> book!
           "Cheaper by the Dozen"),
             "This is a ",

 Another example from the XML Namespaces Recommendation
     <NAME HTML:CLASS="largeSansSerif">
         Layman, A</NAME>
     <SEAT CLASS='Y'
     <HTML:A HREF='/cgi-bin/ResStatus'>
         Check Status</HTML:A>
              "Layman, A"),
                "Check Status"),



Many of the ideas in SmallML are due to E. Dean Tribble and Tyler Close.

This document and design is directly an shamelessly derived from SXML revision 2.1 by Oleg Kiselyov. The following are his acknowledgements on the original:

Discussions with Kirill Lisovsky of MISA University are gratefully acknowledged. He shares the credit for this page. The errors are all mine.



[McCarthy] John McCarthy. Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I. Comm. ACM, 3(4):184-195, April 1960.

[Clark1999] Jim Clark. XML Namespaces. February 4, 1999.

[R5RS] R. Kelsey, W. Clinger, J. Rees (eds.), Revised5 Report on the Algorithmic Language Scheme. Higher-Order and Symbolic Computation, Vol. 11, No. 1, September, 1998 and ACM SIGPLAN Notices, Vol. 33, No. 9, October, 1998.

[SSAX] Oleg Kiselyov. Functional XML parsing framework: SAX/DOM and SXML parsers with support for XML Namespaces and validation. September 5, 2001.

[SXML-short-paper] Oleg Kiselyov. XML and Scheme. An introduction to SXML and SXPath; illustration of SXPath expressiveness and comparison with XPath. September 17, 2000.

[Lisovsky] Kirill Lisovsky. Case sensitivity of Scheme systems.

[DOM] World Wide Web Consortium. Document Object Model (DOM) Level 1 Specification. W3C Recommendation.

[XML] World Wide Web Consortium. Extensible Markup Language (XML) 1.0 (Second Edition). W3C Recommendation. October 6, 2000.

[XML Infoset] World Wide Web Consortium. XML Information Set. W3C Recommendation. 24 October 2001.

[XML Namespaces] World Wide Web Consortium. Namespaces in XML. W3C Recommendation. January 14, 1999.

[XPath] World Wide Web Consortium. XML Path Language (XPath). Version 1.0. W3C Recommendation. November 16, 1999.


Unless stated otherwise, all text on this page which is either unattributed or by Mark S. Miller is hereby placed in the public domain.
ERights Home elang / quasi / xml 
No Previous Sibling No Next Sibling
Download    FAQ    API    Mail Archive    Donate

report bug (including invalid html)

Golden Key Campaign Blue Ribbon Campaign