ERights Home data 
Back to: Common Syntactic Element On to: TermL: Representing Trees of Symbols

The Power of Irrelevance:
Designing notations to support adversarial reviews


From http://www.eros-os.org/pipermail/e-lang/2004-February/009526.html:

At 11:08 PM 1/31/2004 Saturday, Dean Tribble wrote:
>[...]
>I presume that requiring parens on the argument to 'match' would introduce
>other problems (in addition to actually violating backwards compatibility).

Yes. The general style followed by most of E, after a clause-introducing keyword, is that parens surround expressions, not patterns:

escape pattern { ...
} catch pattern { ...
while (expr) { ...
if (expr) { ... 

The other major style is function / method declaration, in which we have an identifier followed by a list of patterns within parens:

to ident(pattern,...) ... {
def ident(pattern,...) ... {
when (expr,...) -> ident(pattern) ... {

Our glaring inconsistency, we don't require parens around the collection expression in a for-loop. We accept:

for pattern => pattern in expr { ...

and

for pattern in expr { ...

whereas, if we were consistent, we'd require

for pattern => pattern in (expr) { ...

and

for pattern in (expr) { ...

What's cool about this convention is that it enables a human reader to perform scope analysis on the pre-expansion text by following a few simple rules. Why is this important?

The Power of Irrelevance:

On the design of notation to support the review process

One the lessons driven home by our Xanadu experience: Any data structure used to answer queries from a huge amount of data must first of all provide the property of quick reject: it must quickly disqualify most of the data it holds from being relevant to the query, so that it can afford to do more subtle calculations on what remains. A good example might be rectangular bounding boxes around complex shapes in a graphics app, so that we can quickly reject all shapes that don't almost intersect a ray. We can then do a more subtle calculation on the remaining candidates, to see if they do intersect.

In thinking about lessons from the security review you and David put us through ;), I came to realize that much of its success was due to E's support for a similar quick reject property. Given a question and a large body of largely unfamiliar code, the first job of a programming language notation is to help human readers (both with and without IDE help) do a quick reject: to disqualify most of the program as being irrelevant to the current question without needing to look at most of it. (See also Visible Workings.)

Scope analysis is the reader's main tool for quickly determining, when looking at a program fragment, which things cannot influence what other things. Scope analysis gives a first conservative bound on possible influence analysis.

This explanation helps explains what's wrong with Smalltalk-72, all macro systems I know about 1 2, dynamic scoping, ambient authority, Pascal's var and C++'s references. When looking at a call site (or something that locally looks like it is a call site), can the reader tell, without knowledge of the thing apparently being called, what things of the caller are being made accessible to the callee. All the above language elements disrupt that analysis. For example, in E or C-without-macros, in the call

foo(x)

without knowing foo, we still know that the callee is being given access to x's value, but isn't given the ability to assign to the x variable. If the E or C-without-macro programmer wishes to give foo the ability to change x's value, they need to write

foo(&x)

which makes this extra access visible to readers that don't know foo. Pascal and C++ don't have this property. Unless you know how foo declares the corresponding parameter, you can't tell what the caller has given the callee the ability to do. These tend towards the failure mode of which Smalltalk-72 was the extreme: you can't understand anything about the program until you understand everything.

In Smalltalk-76, Smalltalk-80, Squeak, ML, Kernel-E, or Scheme-without-macros, the program is written in a simple language with very few scoping rules, where those rules reliably allow the reader, from a program fragment, to use scope analysis to disqualify much of the program from relevance to a question, without looking up the definition of things mentioned at call sites. This property is essential for secure code, but is also important in general.

Clearly, all macro systems prior to Scheme's invention of the notion of hygienic macros (who invented this notion?) fatally disrupt this ability. Scope analysis was only done after macro expansion, so the reader would have to simulate the expansion in their head before that could use scope analysis. This requires understanding which calls are actually macro calls, and it requires looking up the definition of all those macros and understanding them.

What about hygienic macros themselves? This has pointed the way towards sense, but doesn't go far enough to achieve the above goal. The reader of a fragment of a Scheme program that might be using hygienic macros must still do everything listed above before scope analysis allows them to disqualify any of the program from relevance to their question.

In the absence of macro expansion, we can think of the normal analysis of code as proceeding by the following steps:

  • 1. Reading sequences of bits to recognize characters. (With UTF-8, this step isn't as trivial as it once was.)
  • 2. Lexing: Reading sequences of characters to recognize tokens.
  • 3. Parsing: Reading sequences of tokens to recognize BNF productions, producing ASTs.
  • 4. Scope Analysis: Walking ASTs to match use and defining occurrences of identifiers, hopefully by simple lexical scoping rules.

I believe the above sequence works so well because it has some kind of psychological match to what kinds of recognition tasks humans perform well when looking at program text. I won't try to defend this claim, but I do assume it. From my own introspection it seems right. Yes, I know introspection is a terrible guide, but it's the best I've got.

In any case, we categorize macro systems according to where they intervene in the above sequence:

  • 1.5. Pre-ANSI-C operated on character sequences prior to lexing.
  • 2.5. ANSI-C operates on the token stream between lexing and parsing.
  • 3.5. Non-hygienic Lisp macro systems operated on the S-Expression (Lisp's AST) between parsing and scope analysis.
  • 4. Scheme's hygienic macros operates after parsing, and interleaved with scope analysis.

I believe most would agree that each step along this sequence has led to an increase in quality, from a software engineering standpoint. I claim that the reason is that each step has enabled a reader to understand more about a program fragment prior to simulating the expansion of the macros. The E-to-Kernel-E expansion comes close to providing the following property. Where E fails to provide the following property, that should be considered a bug to be fixed. Should E ever allow user-defined macros, these must preserve the following property as well. The productions reserved for macros in the E grammar are indeed able to produce a macro system with this property.

  • 4.5. Expansion happens after scope analysis. Scope analysis therefore happens on the pre-expansion text, without reference to the definition of the macros being invoked. (Of course, it need not be implemented this way, as long as the effect is equivalent.)

I claim this would be better by direct extrapolation of the above argument.

However, we also wish the results of expansion to be a program in the language, so that we may use the same notation to explain the results of expansion. This leads to the following rule:

Imagine that we printed the expanded program (in our case, the resulting Kernel-E program) as the source text of a program in the source language (in our case, an E program). If a scope analysis of the expanded program would yield different bindings than the pre-expansion scope analysis, statically reject the program.

Here's an example where earlier E implementations failed to obey this rule:

? pragma.syntax("0.8")

? def i := 3
# value: 3

(The following test case no longer describes how E actually behaves, and so it is protected by an extra column of "#" so that it won't cause a failing updoc test.)

#? for i in (0..i) { println(i) }
## stdout: 0
##         1
##         2
##         3
##

Surprised? You should be. By the normal E rules, which the E reader can rely on almost all the time, an identifier comes into scope starting at its defining occurrence and proceeding left-to-right until the end of its containing scope box, except as shadowed by definitions of the same identifier in contained scope boxes. A scope box starts at the keyword introducing a clause (here, for), and lasts until the corresponding close curly. http://www.erights.org/elang/blocks/ifExpr.html

By this rule, the reader would be justified in expecting the i in (0..i) to refer to the i in for i in. Because of the expansion of the for-loop, and because the current E implementation's scope analysis, like pre-hygienic Lisps, operates only after expansion, and because the for-loop expansion reverses the order in which these two parts occur, as well as placing the iteration variable i in a nested scope box, the i in (0..i) refers to the def i, whereas the i in println(i) refers to the iteration variable i which shadows the i in def i.

? for i in (0..i) { println(i) }
# syntax error: Use on right isn't really in scope of definition: ["i"]

What E now does is to statically reject the above program. This new behavior may possibly give the program's author an unpleasant static surprise. But the inability to author this kind of obfuscated program will prevent nasty surprises to the readers, which is vastly more important. Even for programs I author, I read them much more than I author them. Legal programs must fulfill the expectations both of readers who are thinking about the pre-expanded text, and of readers who are simulating the expansion in their head.

What does this have to do with parens around patterns? At the head of a scope box, ie, the part between the keyword introducing the clause and the open curly, when we see an identifier, in order to do scope analysis by the above rule, we need to be able to tell whether it's a use occurrence or a defining occurrence. In order to determine this, you need to know if it's in an expression, a pattern, or something else. Given that we can ask the reader to memorize a small fixed set of "something else" cases (eg, identifier(patterns,...) after a to), the only remaining rule is: If it's surrounded with parens, it's an expression. If not, it's a pattern.

Should we ever allow user-defined macros, they would be constrained to play by this rule, enabling pre-expansion scope analysis to proceed using the left-to-right rule. As with the for-loop, rather than constrain macros only to produce expansions that preserve the scope analysis, we can instead reject just those programs that cause an expansion that would disrupt the pre-expansion scope analysis.

Other known occurrences of this bug in the current E implementation:

  • Dean has objected that the current expansions of && and || can disrupt a pre-expansion scope analysis. I agree. The current expansions of these must be considered buggy on these grounds. Fortunately, an alternate expansion can fix this in all cases, never requiring a surprising static reject of a program.

  • In the expansion of, for example, the quasi-pattern `-$i-@i-@j-$j-`, the relative order among the embedded dollar-hole expressions is preserved, and the relative order among the at-hole patterns is preserved, but all the extracted expressions occur in the expansion to the left of all the extracted patterns. As one would expect from the pre-expansion text, $i refers to the i already in scope, not that defined by @i. But $j also refers to the j already in scope, not the j defined by @j. This violates the left-to-right rule. E must instead statically reject the program, rather than allow a reader to be confused about the meaning of $j.


1 JSE, The Java Syntactic Extender, is a possible exception. It is clearly a fellow traveller, but I don't yet understand it well enough to know it is has the property we advocate here.

2 Since writing the original email, Oleg Kiselyov has pointed out the large and wonderful world of "multi-stage programming", much of which takes the above perspective to an even greater extreme: it moves macro expansion after both scope and type analysis, statically ensuring that dynamically generated expansions are type safe.


An attempt to backdoor the Linux kernel is, of course, the classic story of an open review process preventing a security hole, even in a language (C) in which it is too easy to write malicious code that looks innocent.

Oleg Kiselyov has also pointed out the wonderful/terrible story of the boy that ended up with a police record because he accidentally put an extra pair of parens in a Perl program. This is an example of the kind of accident we hope to make less likely.


Acknowledgements

I'd like to thank Dave Liebs for prodding me into writing this.

 
Unless stated otherwise, all text on this page which is either unattributed or by Mark S. Miller is hereby placed in the public domain.
ERights Home data 
Back to: Common Syntactic Element On to: TermL: Representing Trees of Symbols
Download    FAQ    API    Mail Archive    Donate

report bug (including invalid html)

Golden Key Campaign Blue Ribbon Campaign