Conventions for Common Data Types
The Value
data type is essentially an S-Expression, able to
represent semi-structured data over ByteString
, String
,
SignedInteger
atoms and so on.1
However, users need a wide variety of data types for representing domain-specific values such as various kinds of encoded and normalized text, calendrical values, machine words, and so on.
Appropriately-labelled Record
s denote these domain-specific data
types.2
All of these conventions are optional. They form a layer atop the core
Value
structure. Non-domain-specific tools do not in general need to
treat them specially.
Validity. Many of the labels we will describe in this section come
with side-conditions on the contents of labelled Record
s. It is
possible to construct an instance of Value
that violates these
side-conditions without ceasing to be a Value
or becoming
unrepresentable. However, we say that such a Value
is invalid
because it fails to honour the necessary side-conditions.
Implementations SHOULD allow two modes of working: one which
treats all Value
s identically, without regard for side-conditions,
and one which enforces validity (i.e. side-conditions) when reading,
writing, or constructing Value
s.
Metaconventions.
By and large Capitalized
and CamelCase
identifiers refer to types or schema definition
names describing families of Value
s, while kebab-case
, lisp-style
identifiers are used
for concrete symbols appearing in e.g. Record
labels.
IOLists.
Inspired by Erlang’s notions of
iolist()
and iodata()
,
an IOList
is any tree constructed from ByteString
s and
Sequence
s. Formally, an IOList
is either a ByteString
or a
Sequence
of IOList
s.
IOList
s can be useful for
vectored I/O.
Additionally, the flexibility of IOList
trees allows annotation of
interior portions of a tree.
Comments.
String
values used as annotations are conventionally interpreted as
comments. Special syntax exists for such string annotations, though
the usual @
-prefixed annotation notation can also be used.
# I am a comment for the Dictionary
{
# I am a comment for the key
key: # I am a comment for the value
value
}
# I am a comment for this entire IOList, as are the next three lines.
#
# The previous line (containing only hash-newline) adds an empty
# string to the annotations attached to the entire IOList.
[
#x"00010203"
# I am a comment for the middle half of the IOList
# A second comment for the same portion of the IOList
@ # I am the first and only comment for the following comment
"A third (itself commented!) comment for the same part of the IOList"
[
# I am a comment for the following ByteString
#x"04050607"
#x"08090A0B"
]
#x"0C0D0E0F"
]
Interpreter specification lines (“shebang” lines).
Unix systems interpret #!
at the beginning of an executable file specially. The text
following #!
on the first line is interpreted as a specification for an interpreter for the
executable file. Preserves offers special support for #!
, reading it similarly to a comment,
but producing an <interpreter ...>
annotation instead of a string.
For example,
#!/usr/bin/preserves-tool convert
[1, 2, 3]
is read as
@<interpreter "/usr/bin/preserves-tool convert"> [1, 2, 3]
MIME-type tagged binary data.
Many internet protocols use
media types (a.k.a MIME types)
to indicate the format of some associated binary data. For this
purpose, we define MIMEData
to be a record labelled mime
with two
fields, the first being a Symbol
, the media type, and the second
being a ByteString
, the binary data.
While each media type may define its own rules for comparing
documents, we define ordering among MIMEData
representations of
such media types following the general rules for ordering of
Record
s.
Examples.
<mime application/octet-stream #"abcde">
<mime text/plain #"ABC">
<mime application/xml #"<xhtml/>">
<mime text/csv #"123,234,345">
Unicode normalization forms.
Unicode defines multiple
normalization forms for text.
While no particular normalization form is required for String
s,
users may need to unambiguously signal or require a particular
normalization form. A NormalizedString
is a Record
labelled with
unicode-normalization
and having two fields, the first of which is a
Symbol
specifying the normalization form used (e.g. nfc
, nfd
,
nfkc
, nfkd
), and the second of which is a String
whose
underlying Unicode scalar value sequence MUST be normalized according to
the named normalization form.
IRIs (URIs, URLs, URNs, etc.).
An IRI
is a Record
labelled with iri
and having one field, a
String
which is the IRI itself and which MUST be a valid absolute
or relative IRI.
Machine words.
The definition of SignedInteger
captures all integers. However, in
certain circumstances it can be valuable to assert that a number
inhabits a particular range, such as a fixed-width machine word.
A family of labels i
n and u
n for n ∈ {8,16,32,64,128} denote
n-bit-wide signed and unsigned range restrictions, respectively.
Records with these labels MUST have one field, a SignedInteger
,
which MUST fall within the appropriate range. That is, to be valid,
- in
<i8
x>
, -128 <= x <= 127. - in
<u8
x>
, 0 <= x <= 255. - in
<i16
x>
, -32768 <= x <= 32767. - etc.
Anonymous Tuples and Unit.
A Tuple
is a Record
with label tuple
and zero or more fields,
denoting an anonymous tuple of values.
The 0-ary tuple, <tuple>
, denotes the empty tuple, sometimes called
“unit” or “void” (but not e.g. JavaScript’s “undefined” value).
Null and Undefined.
Tony Hoare’s
“billion-dollar mistake”
can be represented with the 0-ary Record
<null>
. An “undefined”
value can be represented as <undefined>
.
Dates and Times.
Dates, times, moments, and timestamps can be represented with a
Record
with label rfc3339
having a single field, a String
, which
MUST conform to one of the full-date
, partial-time
, full-time
,
or date-time
productions of section 5.6 of RFC
3339. (In
date-time
, “T” and “Z” MUST be upper-case and “T” MUST be used;
a space separating the full-date
and full-time
MUST NOT be
used.)
XML Infoset
XML Infoset describes the semantics of XML - that is, the underlying information contained in a document, independent of surface syntax.
A useful subset of XML Infoset, namely its Element Information Items (omitting processing instructions, entities, entity references, comments, namespaces, name prefixes, and base URIs), can be captured with the schema
Node = Text / Element .
Text = string .
Element =
/ @withAttributes
<<rec> @localName symbol [@attributes Attributes @children Node ...]>
/ @withoutAttributes
<<rec> @localName symbol @children [Node ...]> .
Attributes = { symbol: string ...:... } .
Examples.
<html
<h1 {class: "title"} "Hello World!">
<p
"I could swear I've seen markup like this somewhere before. "
"Perhaps it was "
<a {href: "https://docs.racket-lang.org/search/index.html?q=xexpr%3F"} "here">
"?"
>
<table
<tr <th> <th "Column 1"> <th "Column 2">>
<tr <th "Row 1"> <td 123> <td 234>>>
>
Notes
-
Rivest’s S-Expressions are in many ways similar to Preserves. However, while they include binary data and sequences, and an obvious equivalence for them exists, they lack numbers per se as well as any kind of unordered structure such as sets or maps. In addition, while “display hints” allow labelling of binary data with an intended interpretation, they cannot be attached to any other kind of structure, and the “hint” itself can only be a binary blob. ↩
-
Given
Record
’s existence, it may seem odd thatDictionary
,Set
,Double
, etc. are given special treatment. Preserves aims to offer a useful basic equivalence predicate to programmers, and so if a data type demands a special equivalence predicate, asDictionary
,Set
andDouble
all do, then the type should be included in the base language. Otherwise, it can be represented as aRecord
and treated separately.Boolean
,String
andSymbol
are seeming exceptions. The first two merit inclusion because of their cultural importance, whileSymbol
s are included to allow their use asRecord
labels. PrimitiveSymbol
support avoids a bootstrapping issue. ↩