Preserves: a tutorial

By Christine Lemmer Webber and Tony Garnock-Jones
August 2019.

This document, like Preserves itself, is released under version 2.0 of the Apache license.

Overview

Preserves is a serialization system which supplies both a human-readable textual and efficient binary syntax; converting between the two is straightforward. Preserves’ human readable syntax is easy to read and should be mostly familiar if you already know systems like JSON. However, Preserves is more precisely specified than JSON, and also has a clean extension mechanism.

This document is a tutorial; it does not get into all the details of Preserves. For that, see the Preserves specification.

Preserves basics

Starting with the familiar

If you’re familiar with JSON, Preserves looks fairly similar:

    {"name": "Missy Rose",
     "species": "Felis Catus",
     "age": 13,
     "foods": ["kibble", "cat treats", "tinned meat"]}

Preserves also has something we can use for debugging/development information called “annotations”; they aren’t actually read in as data but we can use them for comments. (They can also be used for other development tools and are not restricted to strings; more on this later, but for now, we will stick to the special comment annotation syntax.)

    # I'm an annotation... basically a comment. Ignore me!
    "I'm data! Don't ignore me!"

Preserves supports some data types you’re probably already familiar with from JSON, and which look fairly similar in the textual format:

    # booleans
    #t
    #f

    # various kinds of numbers:
    42
    123556789012345678901234567890
    -10
    13.5

    # strings
    "I'm feeling stringy!"

    # sequences (lists)
    ["cat", "dog", "mouse", "goldfish"]

    # dictionaries (hashmaps)
    {"cat": "meow",
     "dog": "woof",
     "goldfish": "glub glub",
     "mouse": "squeak"}

Going beyond JSON

We can observe a few differences from JSON already; it’s possible to reliably express integers of arbitrary length in Preserves, and booleans look a little bit different. A few more interesting differences:

    # Preserves treats commas as whitespace, so these are the same
    ["cat", "dog", "mouse", "goldfish"]
    ["cat" "dog" "mouse" "goldfish"]

    # We can use anything as keys in dictionaries, not just strings
    {1: "the loneliest number",
     ["why", "was", 6, "afraid", "of", 7]: "because 7 8 9",
     {"dictionaries": "as keys???"}: "well, why not?"}

Preserves technically provides various types of numbers:

    # Signed Integers
    42
    -42
    5907212309572059846509324862304968273468909473609826340
    -5907212309572059846509324862304968273468909473609826340

    # Doubles (Double-precision IEEE floats)
    3.141592653589793

Preserves also provides some types that don’t come in JSON. Symbols are fairly interesting; they look a lot like strings but really aren’t meant to represent text as much as they are, well… a symbolic name. Often they’re meant to be used for something that has symbolic importance to the program, but not textual importance (other than to guide the programmer… not unlike variable names).

    # A symbol (NOT a string!)
    JustASymbol

    # You can do mixedCase or CamelCase too of course, pick your poison
    # (but be consistent, for the sake of your collaborators!)
    iAmASymbol
    i-am-a-symbol

    # A list of symbols
    [GET, PUT, POST, DELETE]

    # A symbol with spaces in it
    |this is just one symbol believe it or not|

We can also add binary data, aka ByteStrings:

    # Some binary data, base64 encoded
    #[cGljdHVyZSBvZiBhIGNhdA==]

    # Some other binary data, hexadecimal encoded
    #x"616263"

    # Same binary data as above, base64 encoded
    #[YWJj]

What’s neat about this is that we don’t have to “pay the cost” of base64 or hexadecimal encoding when we serialize this data to binary; the length of the binary data is the length of the binary data.

Conveniently, Preserves also includes Sets, which are collections of unique elements where ordering of items is unimportant.

    #{flour, salt, water}

Canonicalization

This is a good time to mention that even though from a semantic perspective sets and dictionaries do not carry information about the ordering of their elements (and Preserves doesn’t care what order we enter them in for our hand-written-as-text Preserves documents), Preserves provides support for canonical ordering when serializing.

In canonicalizing output mode, Preserves will always write out a given value using exactly the same bytes, every time. This is important and useful for many contexts, but especially for cryptographic signatures and hashing.

    # This hand-typed Preserves document...
    {monkey: {"noise": "ooh-ooh",
              "eats": #{"bananas", "berries"}}
     cat: {"noise": "meow",
           "eats": #{"kibble", "cat treats", "tinned meat"}}}

    # Will always, always be written out in this order (except in
    # binary, of course) when canonicalized:
    {cat: {"eats": #{"cat treats", "kibble", "tinned meat"},
           "noise": "meow"}
     monkey: {"eats": #{"bananas", "berries"},
              "noise": "ooh-ooh"}}

Defining our own types using Records

Finally, there is one more type that Preserves provides… but in a sense, it’s a meta-type. Record objects have a label and a series of arguments (or “fields”). For example, we can make a Date record:

    <Date 2019 8 15>

In this example, the Date label is a symbol; 2019, 8, and 15 are the year, month, and day fields respectively.

Why do we care about this? We could instead just decide to encode our date data in a string, like “2019-08-15”. A document using such a date structure might look like so:

    {"name": "Gregor Samsa",
     "description": "humanoid trapped in an insect body",
     "born": "1915-10-04"}

Unfortunately, say our boss comes along and tells us that the people doing data entry have complained that it isn’t always possible to get an exact date. They would like to be able to type in what they know if they don’t know the date exactly.

This causes a problem. Now we might have two kinds of entries:

    # Exact date known
    {"name": "Gregor Samsa",
     "description": "humanoid trapped in an insect body",
     "born": "1915-10-04"}

    # Not sure about exact date...
    {"name": "Gregor Samsa",
     "description": "humanoid trapped in an insect body",
     "born": "Sometime in October 1915?  Or was that when he became an insect?"}

This is a mess. We could just try parsing a regular expression to see if it “looks like a date”, but doing this kind of thing is prone to errors and weird edge cases. No, it’s better to be able to have a separate type:

    # Exact date known
    {"name": "Gregor Samsa",
     "description": "humanoid trapped in an insect body",
     "born": <Date 1915 10 04>}

    # Not sure about exact date...
    {"name": "Gregor Samsa",
     "description": "humanoid trapped in an insect body",
     "born": <Unknown "Sometime in October 1915?  Or was that when he became an insect?">}

Now we can distinguish the two.

We can make as many Record types as our program needs, though it is up to our program to make sense of what these mean. Since Preserves does not specify the Date itself, both the program (or person) writing the Preserves document and the program reading it need to have a mutual understanding of how many fields it has and what the meaning the label signifies for it to be of use.

Still, there are plenty of interesting labels we can define. Here is one for an “iri”, a hyperlink:

    <iri "https://dustycloud.org/blog/">

That’s nice enough, but here’s another interesting detail… labels on Records are usually symbols but aren’t necessarily so. They can also be strings or numbers or even dictionaries. And very interestingly, they can also be other records:

    < <iri "https://www.w3.org/ns/activitystreams#Note">
      {"to": [<iri "https://chatty.example/ben/">],
       "attributedTo": <iri "https://social.example/alyssa/">,
       "content": "Say, did you finish reading that book I lent you?"} >

Do you see it? This Record’s label is… an iri Record! The link here points to a more precise term saying that “this is a note meant to be sent around in social networks”. It is considerably more precise than just using the string or symbol “Note”, which could be ambiguous. (A social networking note? A footnote? A music note?) While not all systems need this, this (partial) example hints at how Preserves can also be used to coordinate meaning in larger, more decentralized systems.

Likewise, it is also possible to annotate records with integers. Languages like OCaml use integers instead of symbolic record labels because their type systems ensure that it is never ambiguous what, say, the label 23 means in any given context. Allowing integer record labels lets Preserves directly express OCaml data.

Annotations

Annotations are not strictly a necessary feature, but they are useful in some circumstances. We have previously shown them used as comments:

    # I'm a comment!
    "I am not a comment, I am data!"

Annotations annotate the values they precede. It is possible to have multiple annotations on a value. The hash-space (or hash-tab) comment syntax is syntactic sugar for the general @-prefixed string annotation syntax.

    # I am annotating this number
    @"And so am I!"
    42

As said, annotations are not really data. They are merely meant for development tooling or debugging. You have to explicitly ask for them when reading, and they wrap all the values. Many implementations will, in the same mode, also supply line number and column information attached to each read value.

So what’s the point of them then? If annotations were just for comments, there would be indeed hardly any point at all… it would be simpler to just provide a comment syntax.

However, annotations can be used for more than just comments. They can also be used for debugging or other development-tool-oriented data.

For instance, here’s a reply from an HTTP API service running in “debug” mode annotated with the time it took to produce the reply and the internal name of the server that produced the response:

    @<ResponseTime <Milliseconds 64.4>>
    @<BackendServer "humpty-dumpty.example.com">
    <Success
      <Employees [
        <Employee "Alyssa P. Hacker"
                  #{<Role Programmer>, <Role Manager>}
                  <Date 2018, 1, 24>>
        <Employee "Ben Bitdiddle"
                  #{<Role Programmer>}
                  <Date 2019, 2, 13>> ]>>

The annotations aren’t related to the data requested, which is all about “employees”; instead, they’re about the systems that produced the response. You could say they’re in the domain of “debugging” instead of the domain of “employees”.

Conclusions

We’ve covered the broad strokes of Preserves, but not everything that is possible with it. We leave it as an exercise to the reader to try reading these examples into their languages (several libraries exist already) and writing them out as binary objects.

But as we’ve seen, Preserves is a flexible system which comes with well-defined, carefully specified built-in types, as well as a meta-type which can be used as an extension point.

Happy preserving!