ERights Home elang / grammar 
Back to: Quasi-Literals and Term / Functor Trees On to: Methods and Matchers

and XML

Note: This document is now obsolete, but is being preserved as a record of our plans preceding our decision to use Antlr-based Term / Functor trees instead, as will be explained here.

XML's greatest strength is also its greatest weakness. XML standardizes a single universal meta-notation for expressing a vast variety of different types of semi-structured data. The advantage of this approach is that generic tools can be built that apply to all data expressed in this notation. The disadvantage of shoehoring all data into a one-size-fits-all notation is that this "universal" notation will be a very poor fit for many particular types of data. Historically, rather than having a universal notation, the world created a vast number of specialized notations, each optimized for its particular subject matter. This approach has exactly the opposite advantages and disadvantages of the XML approach.

This paper shows how, by leveraging E's QuasiParser Framework, we can have the best of both worlds while maintaining XML compatibility.

This paper describes a proposal for enhancing E, rather than features currently in E or decisions that have already been made. We welcome your feedback on the proposal so we may make better decisions prior to implementation.

Making XML Usable

We propose to use three technologies to improve the usability of XML:

  • A Parser Generator whose actions build XML/DOM trees.

    This technology creates notational interoperability.

  • A QuasiParser for creating quasi-literal XML expressions and quasi-literal XML patterns

    With this technology, programmers can write transformation code in the match-bind-substitute programming paradigm in the context of general programmability.

  • Adoption of Minimal-XML, at least to start with, and our own correspondingly minimal DOM-tree implementation for representing Minimal-XML parses. Minimal-XML is a downward compatible subset of XML, and (using the Minimizer) most of XML can be embedded in Minimal-XML.

Each of these technologies by themselves make XML significantly more usable. In combination, they amplify each other, enabling programmers greater ease at manipulating XML.

Notational Interoperability

Abstracting away from the particulars of XML syntax, what is the essential ideaof XML? XML expresses trees of symbols in a manner very similar (syntax aside) to Lisp S-Expressions or Prolog term-trees (sometimes called Herbrand trees). Lisp S-Expressions and Prolog term-trees are usually thought of primarily as data structures, and only secondarily as notations. What if we take this perspective to XML as well?

It turns out there already is a well standardized tree data structure that corresponds one-for-one with XML: the Document Object Model, or DOM. Since XML and DOM trees correspond exactly, we can, without loss of generality, build our tools to convert to and from DOM trees rather than XML notation. Now we can introduce specialized notations freely, as long as we simultaneously introduce a platform independent means of parsing these notations into DOM trees, and of flattening such DOM trees back into that specialized notation. In this context, XML notation is properly seen as simply another notation to convert to and from DOM trees. The distinguishing characteristic of XML remains only what we'd expect: its univerality. Only XML needs to be able to describe any DOM tree.

Since any of our other notations can now be converted to and from XML, XML can remain the universally understood interchange medium -- both to other tools and to other sites -- while still allowing programmers to manipulate specialized forms of data in notations suited for that form of data.

Minimal-XML is the subset of XML that most closely corresponds to the good stuff -- the S-Expression-like functionality.

Match-Bind-Substitute Programming

So how should DOM trees be examined and manipulated by programmers? Currently programmers are stuck between two bad choices:

  • They can program to the DOM API explicitly.

    Occasionally, this is the right thing. More often this results in expressing concepts far too indirectly. Think about the difference between driving vs. explaining to someone how to drive. This is like the difference between expressing mostly-literal data mostly literally vs. having to write code to generate all the data.

  • They can write transformations in XSL.

    XSL is a specialized language built specifically for transforming XML, into XML or other notations, but not for transforming other notations into XML. Most damaging, XSL is not Turing complete (does not have the power of any general purpose programming language), and so is severely restricted in the transformations it can express. Nevertheless, it is attractive for simple transformations precisely because of its support for match-bind-substitute programming.

What is the match-bind-substitute style, and how is it used in XSL? Consider the following snippet of XSL:

<xsl:template match="Sect1/Title">

The first line is a pattern that matches against an XML subtree exactly when its root is a Sect1 tag and it contains Title-tagged subtree as an immediate child. Besides matching, this pattern also binds the children of Title to an implicit variable used (implicitly again) in the next line. Though this pattern is performing a match-bind it does not look like the data it matches. In more complex XSL cases involving containment, the pattern does have a partially literal resemblance (in certain regards) to the data it matches against, so we would call this a quasi-literal pattern.

The second line shows the flip side, a quasi-literal expression. An expression evaluates to the data to be produced. A quasi-literal expression is an expression that looks partially like the data it will produce. In this case, the data to be produced is HTML bracketed by "<H2>" and "</H2>" tags just as they appear in the expression. The non-literal parts of the expression are those prefixed with "<xsl:". These are evaluated according to different rules and the results substituted in to the enclosing literal context. However, these extra rules are simply the primitives built into XSL, and despite the name, are not extensible.

The E quasi-parser framework combines the directness of XSL-style match-bind-substitute programming with the power of general purpose programming. The E programmer can escape the dichotomy presented by the above two bullets.

Example: MathML

MathML is a particularly notorious example of the price XML pays for using a single generic syntax for all jobs. In MathML, the simple expression

x2 + x3

is expressed as


This small example, though unpleasant, is still something humans can deal with. However, the problem gets so bad with large examples that it's recommended for humans to deal with equations only through graphical wysiwyg equation editors (such as WebEQ). That's fine for human editors, but what about human programmers?

Further, what about XML DTDs defined for narrower domains, without the broad interest that would produce specialized wysiwyg editors? The appendix below shows an American Option written in the Financial Products Markup Language, or FpML, the emerging standard for representing derivative intruments. A glance shows the notation to be at least as unwieldy as MathML, but FPML advocates offer no alternative to their notational burden. Increasingly, XML is being applied to non-document data. MathML is a representative small example of the issues that result.

Programmers need to write code that generates, recognizes, and transforms XML. For an example transformation that should be easy, consider symbolic differentiation of simple mathematical expressions, as is taught to undergraduates in introductory Lisp and Prolog courses. Since XML transformations are normally written in XSL, one would expect that symbolic differentiation could easily be expressed in XSL. Unfortunately, XSL, with its fixed set of primitives, runs out of steam too quickly. I was not able to figure out how to write it at all in XSL, and neither could other programmers I asked.

Turning to the other currently available alternative, let's see how symbolic differentiation looks in Java manipulating DOM trees. For compactness, we only show the rules necessary for differentiating the two operators in our example, addition and exponentiation-to-a-constant. symbolic differentiation in Java on DOM trees...

By contrast, using the XML QuasiParser technology described below, we can express the same transformation in E as

def deriv(expr, var) :any {
    switch(expr) {
        match math`$var ** @exp` ? (isConst(exp)) {
            math`$exp * $var ** ($exp - 1)`
        match math`@a + @b` {
            def da := deriv(a, var)
            def db := deriv(b, var)
            math`$da + $db`

The last 3 lines could have instead been written as

            math`${deriv(a, var)} + $(deriv(b, var)}`

depending on your taste. This is not only compact, readable, and maintainable, it also states the transformation rules in a notation recognizable to the relevant subject domain specialists -- in this case, to mathematicians. Despite the notational shift, the output is a DOM tree which can be written out as a new MathML expression in XML. What do we need to build to make this example possible? Below, we build up to this example in stages.

Usable XML is Three Stages

First we explain the existing QuasiParser framework already integrated into E. Then we explain how we will extend this, in three stages, into a system able to express the above examples.

  1. An XML QuasiParser

    Enables programmers to write transformation code in the match-bind-substitute style, a style whose power has been amply demonstrated by XSL, Prolog, and Perl.

  2. Parser Generation with XML Actions

    Enables programmers to create specialized notations for expressing data in particular DTDs, while maintaining XML compatibility.

  3. Generating QuasiParsers Instead

    Enables the application of these new notations to the match-bind-substitute programming of XML transformations.

Step 1: An XML QuasiParser

To be written...

Example: Symbolic Differentiation in XML Notation

Armed only with the XML QuasiParser itself, we can express XML transformations using quasi-literal XML notation. For example, our symbolic differentiation routine could now be written:

def deriv(expr, var) :any {
    switch(expr) {
        match xml`<apply>
                  </apply>` ? (isConst(exp)) {
        match xml`<apply>
                  </apply>` {
            def da := deriv(a, var)
            def db := deriv(b, var)

Step 2: Parser Generation with XML Actions

To be written...

Example: Defining DTD-Specific Grammars

The following grammar is written in a hypothetical variant of BYacc/Java or ANTLR in which the actions are XML quasi-literal strings as supported by Step 1 above. (We would probably create an ANTLR wrapper.)

 |    a:expr '+' b:mult          { <apply><plus/>$a$b</apply> }
 |    a:expr '-' b:mult          { <apply><minus/>$a$b</apply> }

 |    a:mult '*' b:pow           { <apply><times/>$a$b</apply> }
 |    a:mult '/' b:pow           { <apply><divide/>$a$b</apply> }

 |    a:prim '**' b:prim         { <apply><power/>$a$b</apply> }

If the generated parser is called Math, then executing

Math("x**2 + x**3")

produces the same DOM tree that would be produced by parsing the corresponding XML. From this DOM tree, the corresponding XML can be trivially output.

But a specialized data set isn't an island. A document should properly be able to contain multiple different kinds of data. We would like to express each kind of data in the notation most suitable for it, while allowing these data items to be embedded in each other. Our XML QuasiParser already solves this for us. The expression

xml`<Sect1><Para>For example, ${Math("x**2 + x**3")} is nice</Para></Sect1>`

Produced the same DOM tree as this XML:

  <Para>For example, <apply>
    </apply>is nice</Para>

Step 3: Generating QuasiParsers Instead

To be written...


The code in the Appendix below is a typical sample use of FpML, the Financial Products Markup Language, to represent a derivative -- in this case, an American Call Option. This code and others like it are posted by FpML advocates, so one may assume that it represents good FpML usage.

FpML is both an important standard in its own right, and representative of what happens when XML is applied to semi-structured data outside the "document" paradigm. No where else in the history of computation has such a great excess of syntax over semantics been long sustained. XML programmers desperately need notational relief! The E QuasiParser Framework can provide much of that relief, while maintaining compatibility .

Appendix: An American Call Option in FpML

The following code is covered by the Mozilla open-source license.

<?xml version="1.0" standalone="no"?>
<!-- Copyright (c) 1999 by J.P.Morgan and PricewaterhouseCoopers.
PricewaterhouseCoopers refers to the individual member firms of the world
wide PricewaterhouseCoopers organization.  All rights reserved.
<!-- version 1.0b2 : August 6, 1999 -->

<!DOCTYPE fpml:FpML SYSTEM "fpml.dtd" >
<fpml:FpML xmlns:fpml="urn:fpml-FpML"
        xmlns:r="urn:fpml-rate" >
      <tid:TradeIDs xmlns:tid="urn:fpml-TradeID">
          <tid:partyReference>XYZ Investement Bank</tid:partyReference>
          <tid:partyReference>ABC Investments</tid:partyReference>
      <fpvo:FXVanillaOption xmlns:fpvo="urn:fpml-FX-Product-VanillaOption">
                <fxos:buyerReference>XYZ Investement Bank
                <fxos:sellerReference>ABC Investments
                    <fxos:cutoffTime>10:00 AM</fxos:cutoffTime>
                <!-- premiumterm could be
                     ccy1/ccy2, ccy2/ccy1, %ccy1, %ccy2 -->
      <pty:PartyInformation xmlns:pty="urn:fpml-party"
          <pty:Party name="ABC_Trust">
                <pty:shortName>ABC Investments</pty:shortName>
                    <a:contactName>Jacob Smith</a:contactName>
                    <a:contactOrganizationName>ABC Investments
                            <a:streetLine>123, Park Avenue</a:streetLine>
                        <a:city>New York</a:city>
                    <a:contactName>Jacob Smith</a:contactName>
                    <a:contactOrganizationName>Trade Settlements
                    <pty:payFromName>ABC Bank PLC</pty:payFromName>
                        <pty:fiName>ABC Bank PLC</pty:fiName>
                        <pty:fiName>XYZ Bank PLC</pty:fiName>
                        <pty:fiName>ABC Brokers</pty:fiName>
                          For the account of Harry Smith-/883733333
                Also expected is the full settlement instructions for:
                <pty:partyReferenceName>XYZ Investment bank
Unless stated otherwise, all text on this page which is either unattributed or by Mark S. Miller is hereby placed in the public domain.
ERights Home elang / grammar 
Back to: Quasi-Literals and Term / Functor Trees On to: Methods and Matchers
Download    FAQ    API    Mail Archive    Donate

report bug (including invalid html)

Golden Key Campaign Blue Ribbon Campaign