Lessons of Deconstruction SerializationThe body of each chapter is presented using running code examples, and with detailed enough explanation that you should be able to follow this code. For those that wish only to understand the general lessons and how they may be applied to more conventional serialization systems, see the "Lessons of..." and "Corresponding Concepts in Convention Serialization" sections at the beginning and end of each chapter The "Deconstruction Serialization" chapter defines our serialization format, Data-E, by subsetting a programming language, E. This move is inessential, but it allows us to more easily understand what's happening. Many existing systems (Lisps, Smalltalk) often print an object as an expression that, if evaluated, would reconstruct that object. We can understand the data by reusing a subset of our code reading skills. When these printing systems do so reliably, they are often grown into serialization systems, in which the depiction actually is the program it appears to be. We can then understand the semantics of the serialized form by reusing a subset of our understanding of the semantics of the programming language. (Serialization in Mozart creates the equivalent of a compiled module, providing the semantic correspondence but not the visibility. Perhaps these can be decompiled.) In these terms, Java has three very different depiction systems.
To produce a depiction of an object graph, a serializer must somehow obtain a representation of each object adequate for calculating an overall depiction. In the printing frameworks, like Java's toString(), the representation offered is a depiction -- it is already only bits, and each object is responsible for traversing the portion of the subgraph rooted in itself. This is maximally flexible -- it allows an object to claim anything it likes. Used for serialization, this flexibility has fatal security problems, as explained in the next chapter. In our approach, the serializer must obtain for each object a portrayal -- a representation of that object in terms of live references to other objects. The serializer's traversal proceeds as it obtains a portrayal for each of these "components", etc. A serializer can simply ask an object to provide a portrayal of itself -- the object's self portrait -- or, as we will see, it can derive or substitute its own portrayal of the object as an expression of some other policy objective. However it obtains these portrayals, the resulting depiction is a literal picture only of the graph of these portrayals, rather than the actual graph of objects being serialized. Concrete Embodiment in EAlthough the ideas in this paper should be applicable to any object-capability language and many serialization formats, as previously mentioned, for concreteness, we present the implementation of these ideas in E as applied to Data-E. When an example is shown as an E command line session, like ? pragma.syntax("0.8") ? 2 + 3 # value: 5 then the example also doubles as an executable regression test. By updocing the page containing the examples, you can see whether the system behaves as shown by the example. If you installed E at, for example, "c:/Program Files/erights.org" and placed "c:/Program Files/erights.org/scripts" on your PATH, then in a directory containing the chapters of this paper, at a shell prompt you can type: $ updoc.e deconstructing.html directory-path/deconstructing.html:............................... (Though please see "Security Considerations" before running this or any other Updoc script.) Each of the dots is a test case that successfully passed, like the above "2 + 3". As you read, if you are curious about how a variation of an example would behave, make a copy of the page, edit appropriately, and updoc it. Or try the examples interactively at a rune command-line: $ rune ? 2 + 3 # value: 5 ? The rune program starts an E read-eval-print loop. The "?" is the E prompt. To exit rune, type the end-of-file character: Control-D or Control-Z depending on your shell. In order to have a common point of reference, this paper assumes a basic prior knowledge of Java. JOSS (Java's Object Serialization Streams) [ref JOSS] is occasionally used for comparison, so a prior knowledge of it will help, but is not required. We assume only that prior knowledge of E explained in the Ode ("Capability-based Financial Instruments") [ref Ode] and that explained in the next section on "E's URI Expressions". E syntactically resembles other C tradition object languages, such as Java, C++, C#, and Python. When the meaning of E code isn't covered by the Ode or the next section, and isn't obvious by analogy with Java, we will explain as we go. For more on E, please see [ref erights.org, Walnut]. E's URI ExpressionsAn unfamiliar bit of E syntax, needed to understand the examples in this paper, is the URI expression, written as a URI string between angle brackets. ? def f := <file:/foo/bar> # value: <file:c:/foo/bar> The protocol name to the left of the URI's colon (file here, but any identifier) is transformed (mangled) into the name of the variable (file__uriGetter) whose value is asked to retrieve the named resource. The characters between the colon and the close angle bracket (which must be legal characters for a URI body), becomes the literal resource name to be looked up ("/foo/bar"). So the above session is a shorthand for the equivalent: ? def f := file__uriGetter.get("/foo/bar") # value: <file:c:/foo/bar> Besides accessing the kind of resources normally accessed by URLs, the URI expression is also used as E's module import mechanism. The typical form of import in E is def name := <import:fully-qualified-name> where fully-qualified-name is the full name, including package prefix, of an E module or a safe Java class. (In order to make the extensive Java libraries available in E without sacrificing capability security, we must tame them -- determine what subset of their public interface is consistent with capability security principles. As part of this taming process, we declare certain Java classes to be "safe", and therefore generally importable.) Because fully qualified names can be long, they are often factored as follows: def <packgeName> := <import:package-prefix.*> ... def name := <packageName:rest-of-path> The package subtree rooted at "org.erights.e.elib" is provided as a built-in convenience, as if we had already executed def <elib> := <import:org.erights.e.elib.*> As a result, ? def deSubgraphKit := <elib:serial.deSubgraphKit> # value: <deSubgraphKit> is equivalent to ? def deSubgraphKit := > <import:org.erights.e.elib.serial.deSubgraphKit> # value: <deSubgraphKit> Such a package subtree root gives access only to that subtree of the original package tree. Similarly, as will be explained in Manipulating Authority at the Exits, a directory can be used as a uriGetter in order to give access only to a subtree of the file system, as retrieved by names relative to that directory. Previews of Data-E SerializationIn the Data-E System, we compose a serializer or unserializer from a pair of Data-E Kits, such as the deSubgraphKit already imported above. Each kit knows how to recognize and build a given kind of Data-E representation. The deSubgraphKit is special, in that the representation it traffics in is subgraphs of actual objects. All the other representations are depictions of subgraphs expressed in the Data-E "language". For example, the deSrcKit traffics in representations of Data-E as source strings, written in the Data-E subset of the E source language. In this document, to make clear when we're looking at Data-E rather than E source, all Data-E source strings are shown prefixed with "de: ". ? def deSrcKit := <elib:serial.deSrcKit> # value: <deSrcKit> To serialize is to recognize a subgraph and to build a depiction. ? "de: " + deSubgraphKit.recognize([false, 3], deSrcKit.makeBuilder()) # value: "de: def t__0 := [false, def t__2 := 3]" Each recognize(..) method takes two arguments -- the input representation to recognize (here, the subgraph to traverse), and a builder to call as parts of the representation are recognized, in order to build the output representation (here, a Data-E source string). Each kit provides a recognize(..) method for accepting its form of representation as input, and a makeBuilder() for making a builder to build its form of representation as output. The recognize(..) method returns its argument builder's final output. The deASTKit manages another depiction of Data-E programs: as Abstract Syntax Trees, as explained in Appendix A: The Data-E Manual. For present purposes, we care only about one feature of this kit, that it simplifies the expression a bit during building; for example, by removing unnecessary temporary variables. Interposing it between between the deSubgraphKit and the deSrcKit, we build our first expository serialize function, serialize_a. (All functions so named are for expository purposes only. See Appendix A for a guide to realistic usage.) ? def deASTKit := <elib:serial.deASTKit> # value: <deASTKit> ? def serialize_a(root) :String { > def ast := deSubgraphKit.recognize(root, deASTKit.makeBuilder()) > "de: " + deASTKit.recognize(ast, deSrcKit.makeBuilder()) > } # value: <serialize_a> ? serialize_a([false, 3]) # value: "de: [false, 3]" To unserialize is to recognize a depiction and to build a subgraph. The matching expository unserialize function follows. Parameters in E are actually patterns, of which the most common is a simple variable name, which defines the variable and binds it to the specimen (the argument passed in). Here we see a pattern for stripping off the additional prefix we added above. This pattern matches a string beginning with "de: ". If the specimen is such a string, then it matches the rest of the string against the pattern following the "@" -- in this case a simple variable name which is bound to the remainder of the string. ? def unserialize_a(`de: @src`) :any { > deSrcKit.recognize(src, deSubgraphKit.makeBuilder()) > } # value: <unserialize_a> ? unserialize_a("de: [false, 3]") # value: [false, 3] The result of evaluating the depiction-as-expression is a reconstruction of the original subgraph. The case shown above stays clear of all the problematic cases for serialization: All the objects in the graph being serialized here (a boolean, an integer, and a list) are transparent -- they willingly divulge all their state to their clients through their public protocol. All the objects participating in the above serialization -- the serialize function and the objects being depicted -- seek only literal accuracy; and likewise during the above unserialization. We call this unproblematic starting case literal realism for transparent subgraphs. A Proper FailureObjects defined in E are encapsulated by default. If we make an encapsulated object and try to serialize it: ? def capsule {} # value: <capsule> ? serialize_a(capsule) # problem: Can't uneval <capsule> we find our simple serialize function fails, as it must. Usually this failure will be the desired behavior. When it isn't, we have several alternatives. Unconditional Transparency
A serializer asks an object for its self portrait with the __optUncall() message. The corresponding default (Miranda) method returns null, indicating that no self portrait is offered. An object which overrides this to return a triple is unconditionally transparent -- it offers its self portrait to any client, just for the asking. The triple describes a call to be performed during unserialization to create the reconstruction of the original object. We start with a simple example: ? def iAmFive { > to __optUncall() :__Portrayal { > [2, "add", [3]] > } > } # value: <iAmFive> ? serialize_a(iAmFive) # value: "de: 2.add(3)"
The expression "2 + 3" is just a syntactic shorthand for "2.add(3)". ? unserialize_a("de: 2.add(3)") # value: 5 ? unserialize_a("de: 2 + 3") # value: 5 The object iAmFive is unconditionally transparent, but is not literally, or even usefully, realistic. When we reconstruct an object according to its self-portrait, the result can be quite different from the original. Named Exit PointsAnother approach to dealing with problematic objects is serialization avoidance by making references to these into named exit points. The traversal of a subgraph stops whenever it encounters such references, writing out named exit points instead (the jigsaw plugs shown on Figure 2: The Three Faces). We define matching serialize and unserialize functions customized to treat references to capsule as an exit point. To create a custom serialize function, we first create a custom unscope -- a table mapping from exit references to names. This is most conveniently done by modifying the default unscope table -- the one used implicitly in the previous examples. ? def unscope_b := deSubgraphKit.getDefaultUnscope().diverge() ? unscope_b[capsule] := "foo" # value: "foo" ? def recognizer_b := deSubgraphKit.makeRecognizer(null, unscope_b) # value: <unevaler> ? def serialize_b(root) :String { > def ast := recognizer_b.recognize(root, deASTKit.makeBuilder()) > "de: " + deASTKit.recognize(ast, deSrcKit.makeBuilder()) > } # value: <serialize_b> ? serialize_b([capsule, 3]) # value: "de: [foo, 3]" As we see, in the Data-E expression produced by serialization, a named exit reference becomes a free variable reference. Data-E unserialization is expression evaluation -- that subset of E expression evaluation applicable to the Data-E subset of E. The inverse of the unscope is therefore just the conventional notion of a scope (or environment), a mapping from variable names to values. On reconstruction, the named exit points will be reconnected to these values. ? def scope_b := deSubgraphKit.getDefaultScope().diverge() ? scope_b["foo"] := def newCapsule {} # value: <newCapsule> ? def unserialize_b(`de: @src`) :any { > # Just a way of saying eval(src, scope_b) > deSrcKit.recognize(src, deSubgraphKit.makeBuilder(scope_b)) > } # value: <unserialize_b> ? unserialize_b("de: [foo, 3]") # value: [<newCapsule>, 3] Each set of exit names together with a mutual understanding about what they may be bound to forms a unique micro-standard data format. Serializers and unserializers must agree on such a micro-standard, and so often come in matched pairs, as above. This agreement still leaves room for separate customization on each side, as with our unserializer's choice to bind a somewhat different object to the name "foo" in the scope, perhaps to adapt to a difference in the unserializer's context.
The Gordian SurgeonSince the serialize function, unserialize function, scope, unscope, and (as we will see) uncallers list are often manipulated together, as above, we introduce the Gordian Surgeon -- an object that knows how to manipulate and wield these five tools in a coordinated fashion to, in effect, cut a subgraph from a donor context, freeze it, thaw it, and transplant it into a recipient graph. The following session is like that above, except that the same capsule is used for serialization and unserialization -- as this is the common pattern the surgeon makes convenient. The addExit(..) below adds the value-name association to the unscope and adds the inverse name-value association to the scope. ? def makeSurgeon := <elib:serial.makeSurgeon> ? def surgeon := makeSurgeon.withSrcKit("de: ").diverge() ? surgeon.serialize([capsule, 3]) # problem: Can't uneval <capsule> ? surgeon.addExit(capsule, "foo") ? surgeon.serialize([capsule, 3]) # value: "de: [foo, 3]" ? surgeon.unserialize("de: [foo, 3]") # value: [<capsule>, 3] Counting "Generations"Our first realistic example uses both self-portraits (with unconditional transparency) and named exit points: ? def makeGenerationCounter(count :int) :any { > def generationCounter { > > /** > * Make my successor with the next larger count > */ > to __optUncall() :__Portrayal { > [makeGenerationCounter, "run", [count+1]] > } > > /** similar purpose as Java's .toString() */ > to __printOn(out :TextWriter) :void { > out.print(`<gen $count>`) > } > } > } # value: <makeGenerationCounter> ? def genCounter := makeGenerationCounter(0) # value: <gen 0> A generationCounter is an object with a single instance variable, count. The makeGenerationCounter function acts like a constructor -- it makes new generationCounter instances. Above, we make the genCounter instance with a count of zero. Each generationCounter uses the underlined code above to "misrepresent" itself as something that would be made by calling makeGenerationCounter with a count value one greater than its own. However, this portrayal enables us to serialize a generationCounter only if we can serialize makeGenerationCounter, which is just as encapsulated as our earlier capsule. Since it is stateless, it is plausible to "serialize" it by not serializing it -- by making it into a named exit point. ? surgeon.addExit(makeGenerationCounter, "makeGenerationCounter") ? surgeon.serialize(genCounter) # value: "de: makeGenerationCounter(1)" ? surgeon.unserialize("de: makeGenerationCounter(1)") # value: <gen 1> A reconstructed generationCounter has a count one greater than its original, thereby accumulating a count of the number of serialize / unserialize cycles it has been through since it was born. Rather than seeing the underlined "misrepresentation" as a problem to prohibit, this example shows how this representational freedom is an opportunity. Unserialization as Evaluation(This is approximately an abridged presentation of the Unserialization as Evaluation section of Appendix A: The Data-E Manual.) As shown above, unserialization can be thought of, or even implemented as, expression evaluation [ref Rees, XMLEncoder]. A depiction is an expression in some programming language, the unserializer is the eval function, the exit references to be reconnected are free variable references, the values to reconnect them to come from the scope (i.e., environment) provided to eval, and the root of the reconstructed subgraph is the value the expression evaluates to. Serialization is the logically inverse process, in which an uneval function is applied to a root and an unscope, and writes out an expression that, were it evaluated in the corresponding scope, would reconstruct the subgraph. Data-E is the subset of E used for depicting a subgraph as an expression. Ignoring precedence, it consists of the following productions: Fundamental Data-E Constructs
E Syntactic Shorthands generated by deSrcKitSince we use deSrcKit to build the depictions we present for expository purposes, we need to know the shorthands it builds, which are a subset of the shorthands recognized and expanded by E. Going the other way, all E syntactic shorthands, including those below, are recognized by deSrcKit, since it uses the E parser to parse and expand its input. Any valid E expression that expands only into the above Data-E primitives is a valid Data-E expression with the same meaning. Likewise any valid Data-E expression is a valid E expression with the same meaning.
Using several cases together: The depiction is shown in the middle following the "de: ". It is written in Data-E and has the following meaning:? def root := [1, root, 1, <import:java.lang.makeStringBuffer>] # value: [1, <***CYCLE***>, 1, <makeStringBuffer>] ? def depiction := surgeon.serialize(root) # value: "de: def t__0 := [def t__2 := 1, # t__0, # t__2, # <import:java.lang.makeStringBuffer>]" ? surgeon.unserialize(depiction) # value: [1, <***CYCLE***>, 1, <makeStringBuffer>]
For those familiar with Java, Data-E should be mostly familiar, but with a few important differences:
By using promises to reconstruct cycles, we safely avoid a host of hazards. Most other serialization systems [ref JOSS, XMLEncoder, BOSS, ...] are defined in systems without any kind of delayed references (promises, logic variables, futures), but which still allow user-defined unserialize-time behavior by the objects being unserialized. In any such system, when unserializing a cycle, user-defined behavior may interact with graph neighbors that are not yet fully initialized. Before an object is fully initialized, it may be easily confused. In a conventional system this is only a minor source of bugs. But in a graph of mutually suspicious objects this would be a major opportunity for an adversary. By referring to an object only with promises until its fully initialized, we make such bugs safely fail-stop, enabling an adversary in the graph only to mount a denial-of-unserialization attack, which it can trivially do anyway, and which we make no effort or claim to prevent. Data-E is a true subset of E, both syntactically and semantically. Since E is a secure object-capability language, the security issues surrounding evaluation of E programs are already well understood. By using a subset of E in this way, we get to leverage this understanding. |
|||||||||||||||||||||||||||||||||||||
Unless stated otherwise, all text on this page which is either unattributed or by Mark S. Miller is hereby placed in the public domain.
|