Preserves: an Expressive Data Language

Tony Garnock-Jones tonyg@leastfixedpoint.com
June 2022. Version 0.6.3.

Preserves is a data model, with associated serialization formats.

It supports records with user-defined labels, embedded references, and the usual suite of atomic and compound data types, including binary data as a distinct type from text strings. Its annotations allow separation of data from metadata such as comments, trace information, and provenance information.

Preserves departs from many other data languages in defining how to compare two values. Comparison is based on the data model, not on syntax or on data structures of any particular implementation language.

This document defines the core semantics and data model of Preserves and presents a handful of examples. Two other core documents define

for the Preserves data model.

Values

Preserves values are given meaning independent of their syntax. We will write “Value” when we mean the set of all Preserves values or an element of that set.

Values fall into two broad categories: atomic and compound data. Every Value is finite and non-cyclic. Embedded values, called Embeddeds, are a third, special-case category.

                      Value = Atom
                            | Compound
                            | Embedded

                       Atom = Boolean
                            | Float
                            | Double
                            | SignedInteger
                            | String
                            | ByteString
                            | Symbol

                   Compound = Record
                            | Sequence
                            | Set
                            | Dictionary

Total order. As we go, we will incrementally specify a total order over Values. Two values of the same kind are compared using kind-specific rules. The ordering among values of different kinds is essentially arbitrary, but having a total order is convenient for many tasks, so we define it as follows:

        (Values)        Atom < Compound < Embedded

        (Compounds)     Record < Sequence < Set < Dictionary

        (Atoms)         Boolean < Float < Double < SignedInteger
                          < String < ByteString < Symbol

Equivalence. Two Values are equal if neither is less than the other according to the total order.

Signed integers.

A SignedInteger is an arbitrarily-large signed integer. SignedIntegers are compared as mathematical integers.

Unicode strings.

A String is a sequence of Unicode code-points.1 Strings are compared lexicographically, code-point by code-point.2

Binary data.

A ByteString is a sequence of octets. ByteStrings are compared lexicographically.

Symbols.

Programming languages like Lisp and Prolog frequently use string-like values called symbols. Here, a Symbol is, like a String, a sequence of Unicode code-points representing an identifier of some kind. Symbols are also compared lexicographically by code-point.

Booleans.

There are two Booleans, “false” and “true”. The “false” value is less-than the “true” value.

IEEE floating-point values.

Floats and Doubles are single- and double-precision IEEE 754 floating-point values, respectively. Floats, Doubles and SignedIntegers are disjoint; by the rules above, every Float is less than every Double, and every SignedInteger is greater than both. Two Floats or two Doubles are to be ordered by the totalOrder predicate defined in section 5.10 of IEEE Std 754-2008.

Records.

A Record is a labelled tuple of Values, the record’s fields. A label can be any Value, but is usually a Symbol.3 4 Records are compared lexicographically: first by label, then by field sequence.

Sequences.

A Sequence is a sequence of Values. Sequences are compared lexicographically.

Sets.

A Set is an unordered finite set of Values. It contains no duplicate values, following the equivalence relation induced by the total order on Values. Two Sets are compared by sorting their elements ascending using the total order and comparing the resulting Sequences.

Dictionaries.

A Dictionary is an unordered finite collection of pairs of Values. Each pair comprises a key and a value. Keys in a Dictionary are pairwise distinct. Instances of Dictionary are compared by lexicographic comparison of the sequences resulting from ordering each Dictionary’s pairs in ascending order by key.

Embeddeds.

An Embedded allows inclusion of domain-specific, potentially stateful or located data into a Value.5 Embeddeds may be used to denote stateful objects, network services, object capabilities, file descriptors, Unix processes, or other possibly-stateful things. Because each Embedded is a domain-specific datum, comparison of two Embeddeds is done according to domain-specific rules.

Examples. In a Java or Python implementation, an Embedded may denote a reference to a Java or Python object; comparison would be done via the language’s own rules for equivalence and ordering. In a Unix application, an Embedded may denote an open file descriptor or a process ID. In an HTTP-based application, each Embedded might be a URL, compared according to RFC 6943. When a Value is serialized for storage or transfer, Embeddeds will usually be represented as ordinary Values, in which case the ordinary rules for comparing Values will apply.

Examples

The definitions above are independent of any particular concrete syntax. The examples of Values that follow are written using the Preserves text syntax, and the example encoded byte sequences use the Preserves binary encoding.

Ordering.

The total ordering specified above means that the following statements are true:

"bzz" < "c" < "caa" < #!"a"
#t < 3.0f < 3.0 < 3 < "3" < |3| < [] < #!#t

Simple examples.

Value Encoded byte sequence
<capture <discard>> B4 B3 07 ‘c’ ‘a’ ‘p’ ‘t’ ‘u’ ‘r’ ‘e’ B4 B3 07 ‘d’ ‘i’ ‘s’ ‘c’ ‘a’ ‘r’ ‘d’ 84 84
[1 2 3 4] B5 91 92 93 94 84
[-2 -1 0 1] B5 9E 9F 90 91 84
"hello" (format B) B1 05 ‘h’ ‘e’ ‘l’ ‘l’ ‘o’
["a" b #"c" [] #{} #t #f] B5 B1 01 ‘a’ B3 01 ‘b’ B2 01 ‘c’ B5 84 B6 84 81 80 84
-257 A1 FE FF
-1 9F
0 90
1 91
255 A1 00 FF
1.0f 82 3F 80 00 00
1.0 83 3F F0 00 00 00 00 00 00
-1.202e300 83 FE 3C B7 B7 59 BF 04 26

The next example uses a non-Symbol label for a record.6 The Record

<[titled person 2 thing 1] 101 "Blackwell" <date 1821 2 3> "Dr">

encodes to

B4                                ;; Record
  B5                                ;; Sequence
    B3 06 74 69 74 6C 65 64           ;; Symbol, "titled"
    B3 06 70 65 72 73 6F 6E           ;; Symbol, "person"
    92                                ;; SignedInteger, "2"
    B3 05 74 68 69 6E 67              ;; Symbol, "thing"
    91                                ;; SignedInteger, "1"
  84                                ;; End (sequence)
  A0 65                             ;; SignedInteger, "101"
  B1 09 42 6C 61 63 6B 77 65 6C 6C  ;; String, "Blackwell"
  B4                                ;; Record
    B3 04 64 61 74 65                 ;; Symbol, "date"
    A1 07 1D                          ;; SignedInteger, "1821"
    92                                ;; SignedInteger, "2"
    93                                ;; SignedInteger, "3"
  84                                ;; End (record)
  B1 02 44 72                       ;; String, "Dr"
84                                ;; End (record)

JSON examples.

Preserves text syntax is a superset of JSON, so the examples from RFC 8259 read as valid Preserves.

The JSON literals true, false and null all read as Symbols, and JSON numbers read (unambiguously) either as SignedIntegers or as Doubles.7

The first RFC 8259 example:

{
  "Image": {
      "Width":  800,
      "Height": 600,
      "Title":  "View from 15th Floor",
      "Thumbnail": {
          "Url":    "http://www.example.com/image/481989943",
          "Height": 125,
          "Width":  100
      },
      "Animated" : false,
      "IDs": [116, 943, 234, 38793]
    }
}

when read using the Preserves text syntax encodes via the binary syntax as follows:

B7
  B1 05 "Image"
  B7
    B1 03 "IDs"      B5
                       A0 74
                       A1 03 AF
                       A1 00 EA
                       A2 00 97 89
                     84
    B1 05 "Title"    B1 14 "View from 15th Floor"
    B1 05 "Width"    A1 03 20
    B1 06 "Height"   A1 02 58
    B1 08 "Animated" B3 05 "false"
    B1 09 "Thumbnail"
      B7
        B1 03 "Url"    B1 26 "http://www.example.com/image/481989943"
        B1 05 "Width"  A0 64
        B1 06 "Height" A0 7D
      84
  84
84

The second RFC 8259 example:

[
  {
     "precision": "zip",
     "Latitude":  37.7668,
     "Longitude": -122.3959,
     "Address":   "",
     "City":      "SAN FRANCISCO",
     "State":     "CA",
     "Zip":       "94107",
     "Country":   "US"
  },
  {
     "precision": "zip",
     "Latitude":  37.371991,
     "Longitude": -122.026020,
     "Address":   "",
     "City":      "SUNNYVALE",
     "State":     "CA",
     "Zip":       "94085",
     "Country":   "US"
  }
]

encodes to binary as follows:

B5
  B7
    B1 03 "Zip"        B1 05 "94107"
    B1 04 "City"       B1 0D "SAN FRANCISCO"
    B1 05 "State"      B1 02 "CA"
    B1 07 "Address"    B1 00
    B1 07 "Country"    B1 02 "US"
    B1 08 "Latitude"   83 40 42 E2 26 80 9D 49 52
    B1 09 "Longitude"  83 C0 5E 99 56 6C F4 1F 21
    B1 09 "precision"  B1 03 "zip"
  84
  B7
    B1 03 "Zip"        B1 05 "94085"
    B1 04 "City"       B1 09 "SUNNYVALE"
    B1 05 "State"      B1 02 "CA"
    B1 07 "Address"    B1 00
    B1 07 "Country"    B1 02 "US"
    B1 08 "Latitude"   83 40 42 AF 9D 66 AD B4 03
    B1 09 "Longitude"  83 C0 5E 81 AA 4F CA 42 AF
    B1 09 "precision"  B1 03 "zip"
  84
84

Notes

  1. All Unicode code-points are permitted, including NUL (code point zero). 

  2. Happily, the design of UTF-8 is such that this gives the same result as a lexicographic byte-by-byte comparison of the UTF-8 encoding of a string! 

  3. The Racket programming language defines “prefab” structure types, which map well to our Records. Racket supports record extensibility by encoding record supertypes into record labels as specially-formatted lists. 

  4. It is occasionally (but seldom) necessary to interpret such Symbol labels as UTF-8 encoded IRIs. Where a label can be read as a relative IRI, it is notionally interpreted with respect to the IRI urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34; where a label can be read as an absolute IRI, it stands for that IRI; and otherwise, it cannot be read as an IRI at all, and so the label simply stands for itself—for its own Value

  5. Rationale. Why include Embeddeds as a special class, distinct from, say, a specially-labeled Record? First, a Record can only hold other Values: in order to embed values such as live pointers to Java objects, some means of “escaping” from the Value data type must be provided. Second, Embeddeds are meant to be able to denote stateful entities, for which comparison by address is appropriate; however, we do not wish to place restrictions on the nature of these entities: if we had used Records instead of distinct Embeddeds, users would have to invent an encoding of domain data into Records that reflected domain ordering into Value ordering. This is often difficult and may not always be possible. Finally, because Embeddeds are intended to be able to represent network and memory locations, they must be able to be rewritten at network and process boundaries. Having a distinct class allows generic Embedded rewriting without the quotation-related complications of encoding references as, say, Records. 

  6. It happens to line up with Racket’s representation of a record label for an inheritance hierarchy where titled extends person extends thing:

    (struct date (year month day) #:prefab)
    (struct thing (id) #:prefab)
    (struct person thing (name date-of-birth) #:prefab)
    (struct titled person (title) #:prefab)
    

    For more detail on Racket’s representations of record labels, see the Racket documentation for make-prefab-struct

  7. The following schema definitions match exactly the JSON subset of a Preserves input:

    version 1 .
    JSON = @string string / @integer int / @double double / @boolean JSONBoolean / @null =null
         / @array [JSON ...] / @object { string: JSON ...:... } .
    JSONBoolean = =true / =false .