Conventions for Common Data Types

The Value data type is essentially an S-Expression, able to represent semi-structured data over ByteString, String, SignedInteger atoms and so on.1

However, users need a wide variety of data types for representing domain-specific values such as various kinds of encoded and normalized text, calendrical values, machine words, and so on.

Appropriately-labelled Records denote these domain-specific data types.2

All of these conventions are optional. They form a layer atop the core Value structure. Non-domain-specific tools do not in general need to treat them specially.

Validity. Many of the labels we will describe in this section come with side-conditions on the contents of labelled Records. It is possible to construct an instance of Value that violates these side-conditions without ceasing to be a Value or becoming unrepresentable. However, we say that such a Value is invalid because it fails to honour the necessary side-conditions. Implementations SHOULD allow two modes of working: one which treats all Values identically, without regard for side-conditions, and one which enforces validity (i.e. side-conditions) when reading, writing, or constructing Values.

Metaconventions.

By and large Capitalized and CamelCase identifiers refer to types or schema definition names describing families of Values, while kebab-case, lisp-style identifiers are used for concrete symbols appearing in e.g. Record labels.

IOLists.

Inspired by Erlang’s notions of iolist() and iodata(), an IOList is any tree constructed from ByteStrings and Sequences. Formally, an IOList is either a ByteString or a Sequence of IOLists.

IOLists can be useful for vectored I/O. Additionally, the flexibility of IOList trees allows annotation of interior portions of a tree.

Comments.

String values used as annotations are conventionally interpreted as comments. Special syntax exists for such string annotations, though the usual @-prefixed annotation notation can also be used.

# I am a comment for the Dictionary
{
  # I am a comment for the key
  key: # I am a comment for the value
       value
}

# I am a comment for this entire IOList, as are the next three lines.
#
# The previous line (containing only hash-newline) adds an empty
# string to the annotations attached to the entire IOList.
[
  #x"00010203"
  # I am a comment for the middle half of the IOList
  # A second comment for the same portion of the IOList
  @ # I am the first and only comment for the following comment
    "A third (itself commented!) comment for the same part of the IOList"
  [
    # I am a comment for the following ByteString
    #x"04050607"
    #x"08090A0B"
  ]
  #x"0C0D0E0F"
]

Interpreter specification lines (“shebang” lines).

Unix systems interpret #! at the beginning of an executable file specially. The text following #! on the first line is interpreted as a specification for an interpreter for the executable file. Preserves offers special support for #!, reading it similarly to a comment, but producing an <interpreter ...> annotation instead of a string.

For example,

#!/usr/bin/preserves-tool convert
[1, 2, 3]

is read as

@<interpreter "/usr/bin/preserves-tool convert"> [1, 2, 3]

MIME-type tagged binary data.

Many internet protocols use media types (a.k.a MIME types) to indicate the format of some associated binary data. For this purpose, we define MIMEData to be a record labelled mime with two fields, the first being a Symbol, the media type, and the second being a ByteString, the binary data.

While each media type may define its own rules for comparing documents, we define ordering among MIMEData representations of such media types following the general rules for ordering of Records.

Examples.

<mime application/octet-stream #"abcde">
<mime text/plain #"ABC">
<mime application/xml #"<xhtml/>">
<mime text/csv #"123,234,345">

Unicode normalization forms.

Unicode defines multiple normalization forms for text. While no particular normalization form is required for Strings, users may need to unambiguously signal or require a particular normalization form. A NormalizedString is a Record labelled with unicode-normalization and having two fields, the first of which is a Symbol specifying the normalization form used (e.g. nfc, nfd, nfkc, nfkd), and the second of which is a String whose underlying Unicode scalar value sequence MUST be normalized according to the named normalization form.

IRIs (URIs, URLs, URNs, etc.).

An IRI is a Record labelled with iri and having one field, a String which is the IRI itself and which MUST be a valid absolute or relative IRI.

Machine words.

The definition of SignedInteger captures all integers. However, in certain circumstances it can be valuable to assert that a number inhabits a particular range, such as a fixed-width machine word.

A family of labels in and un for n ∈ {8,16,32,64,128} denote n-bit-wide signed and unsigned range restrictions, respectively. Records with these labels MUST have one field, a SignedInteger, which MUST fall within the appropriate range. That is, to be valid,

Anonymous Tuples and Unit.

A Tuple is a Record with label tuple and zero or more fields, denoting an anonymous tuple of values.

The 0-ary tuple, <tuple>, denotes the empty tuple, sometimes called “unit” or “void” (but not e.g. JavaScript’s “undefined” value).

Null and Undefined.

Tony Hoare’s “billion-dollar mistake” can be represented with the 0-ary Record <null>. An “undefined” value can be represented as <undefined>.

Dates and Times.

Dates, times, moments, and timestamps can be represented with a Record with label rfc3339 having a single field, a String, which MUST conform to one of the full-date, partial-time, full-time, or date-time productions of section 5.6 of RFC 3339. (In date-time, “T” and “Z” MUST be upper-case and “T” MUST be used; a space separating the full-date and full-time MUST NOT be used.)

XML Infoset

XML Infoset describes the semantics of XML - that is, the underlying information contained in a document, independent of surface syntax.

A useful subset of XML Infoset, namely its Element Information Items (omitting processing instructions, entities, entity references, comments, namespaces, name prefixes, and base URIs), can be captured with the schema

Node = Text / Element .
Text = string .
Element =
  / @withAttributes
    <<rec> @localName symbol [@attributes Attributes @children Node ...]>
  / @withoutAttributes
    <<rec> @localName symbol                        @children [Node ...]> .
Attributes = { symbol: string ...:... } .

Examples.

<html
 <h1 {class: "title"} "Hello World!">
 <p
  "I could swear I've seen markup like this somewhere before. "
  "Perhaps it was "
  <a {href: "https://docs.racket-lang.org/search/index.html?q=xexpr%3F"} "here">
  "?"
 >
 <table
  <tr <th> <th "Column 1"> <th "Column 2">>
  <tr <th "Row 1"> <td 123> <td 234>>>
>

Notes

  1. Rivest’s S-Expressions are in many ways similar to Preserves. However, while they include binary data and sequences, and an obvious equivalence for them exists, they lack numbers per se as well as any kind of unordered structure such as sets or maps. In addition, while “display hints” allow labelling of binary data with an intended interpretation, they cannot be attached to any other kind of structure, and the “hint” itself can only be a binary blob. 

  2. Given Record’s existence, it may seem odd that Dictionary, Set, Double, etc. are given special treatment. Preserves aims to offer a useful basic equivalence predicate to programmers, and so if a data type demands a special equivalence predicate, as Dictionary, Set and Double all do, then the type should be included in the base language. Otherwise, it can be represented as a Record and treated separately. Boolean, String and Symbol are seeming exceptions. The first two merit inclusion because of their cultural importance, while Symbols are included to allow their use as Record labels. Primitive Symbol support avoids a bootstrapping issue.