Preserves: Text Syntax
Tony Garnock-Jones tonyg@leastfixedpoint.com
September 2024. Version 0.996.0.
Preserves is a data model, with associated serialization formats. This
document defines one of those formats: a textual syntax for Value
s
from the Preserves data model that is easy for people
to read and write. An equivalent machine-oriented binary
syntax also exists.
Preliminaries
The definition uses case-sensitive ABNF.
ABNF allows easy definition of US-ASCII-based languages. However, Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as a grammar for recognising sequences of Unicode scalar values.
Encoding. Textual syntax for a Value
SHOULD be encoded using
UTF-8 where possible.
Whitespace. Whitespace ws
is defined as any number of spaces, tabs,
carriage returns, or line feeds.
ws = *(%x20 / %x09 / CR / LF)
Commas. In some positions inside compound terms, commas are permitted and ignored.
commas = *(ws ",") ws
Delimiters. Some tokens (Boolean
, SymbolOrNumber
) MUST be
followed by a delimiter
or by the end of the input.1
delimiter = ws
/ "<" / ">" / "[" / "]" / "{" / "}"
/ "#" / ":" / DQUOTE / "'" / "@" / ";" / ","
Grammar
Standalone documents may have trailing whitespace.
Document = Value ws
Any Value
may be preceded by whitespace.
Value = ws (Record / Collection / Atom / Embedded)
Collection = Sequence / Set / Dictionary
Atom = Boolean / String / ByteString /
QuotedSymbol / SymbolOrNumber
Each Record
is an angle-bracket enclosed grouping of its
label-Value
followed by its field-Value
s.
Record = "<" Value *Value ws ">"
Sequence
s are enclosed in square brackets. Set
s are written as
values enclosed by the tokens #{
and }
. Dictionary
values are
curly-brace-enclosed colon-separated pairs of
values.2 It is an error for a set to contain
duplicate elements or for a dictionary to contain duplicate keys. When
printing sets and dictionaries, implementations SHOULD order elements
resp. keys with respect to the total order over
Value
s.3
Sequence = "[" *(commas Value) commas "]"
Set = "#{" *(commas Value) commas "}"
Dictionary = "{" *(commas Value ws ":" Value) commas "}"
Boolean
s are the simple literal strings #t
and #f
for true and
false, respectively.
Boolean = %s"#t" / %s"#f"
String
s are, as in JSON,
possibly escaped text surrounded by double quotes. The escaping rules are
the same as for JSON,4
5
except that unpaired
surrogate code points
MUST NOT be generated or accepted.6
String = DQUOTE *char DQUOTE
char = <any unicode scalar value except "\" or DQUOTE> / escaped / "\" DQUOTE
escaped = "\\" / "\/" / %s"\b" / %s"\f" / %s"\n" / %s"\r" / %s"\t"
/ %s"\u" 4HEXDIG
A ByteString
may be written in any of three different forms.7
The first is similar to a String
, but prepended with a hash sign #
.
Many bytes map directly to printable 7-bit ASCII; the remainder must be
escaped, either as \x
followed by a two-digit hexadecimal number, or
following the usual rules for double quote and backslash.
ByteString = "#" DQUOTE *binchar DQUOTE
binchar = <any unicode scalar value ≥32 and ≤126 except "\" or DQUOTE>
/ "\" ("\" / "/" / %s"b" / %s"f" / %s"n" / %s"r" / %s"t")
/ %s"\x" 2HEXDIG
/ "\" DQUOTE
The second is pairs of hexadecimal digits interleaved with whitespace
and surrounded by #x"
and "
.
ByteString =/ %s"#x" DQUOTE *(ws 2HEXDIG) ws DQUOTE
The third is a sequence of Base64
characters, interleaved with whitespace and surrounded by #[
and ]
.
Plain (+
,/
)
and URL-safe
(-
,_
) Base64 characters are accepted;
URL-safe
(-
,_
) characters SHOULD be generated by default. Padding characters
(=
) may be omitted.
ByteString =/ "#[" *(ws base64char) ws "]"
base64char = ALPHA / DIGIT / "+" / "/" / "-" / "_" / "="
A Symbol
may be written in either of two forms.
The first is a quoted form, much the same as the syntax for String
s,
including embedded escape syntax, except using a bar or single-quote character
('
) instead of a double quote mark.
QuotedSymbol = "'" *symchar "'"
symchar = <any unicode scalar value except "\" or "'"> / escaped / "\'"
Alternatively, a Symbol
may be written in a “bare” form8.
The grammar for numeric data is a subset of the grammar for bare Symbol
s,
so if a SymbolOrNumber
also matches the grammar for Double
or
SignedInteger
then it must be interpreted as one of those, and otherwise
it must be interpreted as a bare Symbol
.
SymbolOrNumber = 1*(ALPHA / DIGIT / sympunct / symuchar)
sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
"?" / "_" / "=" / "+" / "-" / "/" / "." / "|"
symuchar = <any scalar value ≥128 whose Unicode category is
Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd,
Nl, No, Pc, Pd, Po, Sc, Sm, Sk, So, or Co>
Numeric data follow the JSON
grammar except that leading
zeros are permitted and an optional leading +
sign is allowed.
Double
s always have either a fractional part or an exponent
part, where SignedInteger
s never have
either.9
10
Double = flt
SignedInteger = int
nat = 1*DIGIT
int = ["-"/"+"] nat
frac = "." 1*DIGIT
exp = %i"e" ["-"/"+"] 1*DIGIT
flt = int (frac exp / frac / exp)
Some valid IEEE 754 Double
s are not covered by the grammar
above, namely, the several million NaNs and the two infinities. These are
represented as raw hexadecimal strings similar to hexadecimal
ByteString
s. Implementations are free to use hexadecimal floating-point
syntax whereever convenient, even for values representable using the
grammar above.11
Double =/ "#xd" DQUOTE 8(ws 2HEXDIG) ws DQUOTE
Finally, an Embedded
is written as a Value
chosen to represent the
denoted object, prefixed with #:
.
Embedded = "#:" Value
Annotations and Comments
When written down, a Value
may have an associated sequence of
annotations carrying “out-of-band” contextual metadata about the
value. Each annotation is, in turn, a Value
, and may itself have
annotations. The ordering of annotations attached to a Value
is
significant.
Value =/ ws Annotation Value
Annotation = "@" Value
Each annotation is preceded by @
; the underlying annotated value
follows its annotations. Here we extend only the syntactic nonterminal
named “Value
” without altering the semantic class of Value
s.
Comments. Strings annotating a Value
are conventionally
interpreted as comments associated with that value. Comments are
sufficiently common that special syntax exists for them.
Annotation =/ "#" [(%x20 / %x09) linecomment] (CR / LF)
linecomment = *<any unicode scalar value except CR or LF>
When written this way, everything between the hash-space or hash-tab and
the end of the line is included in the string annotating the Value
.
Comments that are just hash #
followed immediately by newline yield an
empty-string annotation.
Interpreter specification lines. Unix systems specially interpret
#!
at the beginning of a file. For convenient use of Preserves syntax
in Unix scripts, we define an interpretation for #!
lines.
Annotation =/ "#!" linecomment (CR / LF)
Such annotations are read as a record with label interpreter
and with
a single string field containing the text following the !
and
preceding the end of the line.12
13
Equivalence. Annotations appear within syntax denoting a Value
;
however, the annotations are not part of the denoted value. They are
only part of the syntax. Annotations do not play a part in
equivalences and orderings of Value
s.
Reflective tools such as debuggers, user interfaces, and message
routers and relays—tools which process Value
s generically—may
use annotated inputs to tailor their operation, or may insert
annotations in their outputs. By contrast, in ordinary programs, as a
rule of thumb, the presence, absence or content of an annotation
should not change the control flow or output of the program.
Annotations are data describing Value
s, and are not in the domain
of any specific application of Value
s. That is, an annotation will
almost never cause a non-reflective program to do anything observably
different.
Security Considerations
Whitespace. The textual format allows arbitrary whitespace in many positions. Consider optional restrictions on the amount of consecutive whitespace that may appear.
Annotations. Similarly, in modes where a Value
is being read
while annotations are skipped, an endless sequence of annotations may
give an illusion of progress.
Acknowledgements
The text syntax for Boolean
s, Symbol
s, and ByteString
s is
directly inspired by Racket’s lexical
syntax.
Appendix. Regular expressions for bare symbols and numbers
When parsing, if a token matches both SymbolOrNumber
and Number
, it’s a
number; use Double
and SignedInteger
to disambiguate. If it
matches SymbolOrNumber
but not Number
, it’s a “bare” Symbol
.
SymbolOrNumber: ^[-a-zA-Z0-9~!$%^&*?_=+/.|]+$
Number: ^([-+]?\d+)((\.\d+([eE][-+]?\d+)?)|([eE][-+]?\d+))?$
Double: ^([-+]?\d+)((\.\d+([eE][-+]?\d+)?)|([eE][-+]?\d+))$
SignedInteger: ^([-+]?\d+)$
When printing, if a symbol matches both SymbolOrNumber
and Number
or
neither SymbolOrNumber
nor Number
, it must be quoted ('...'
). If it
matches SymbolOrNumber
but not Number
, it may be printed as a “bare”
Symbol
.
Notes
-
The addition of this constraint means that implementations must now use some kind of lookahead to make sure a delimiter follows a
Boolean
; this should not be onerous, as something similar is required to readSymbolOrNumber
s correctly. ↩ -
Implementation note. When implementing printing of
Value
s using the textual syntax, consider supporting (a) optional pretty-printing with indentation, (b) optional JSON-compatible print mode for that subset ofValue
that is compatible with JSON, and (c) optional submodes for no commas, commas separating, and commas terminating elements or key/value pairs within a collection. ↩ -
Rationale. Consistently printing the elements of unordered collections in some arbitrary but stable order helps, for example, keep diffs small and somewhat meaningful when Preserves values are pretty-printed to text documents under source control. ↩
-
The grammar for
String
has the same effect as the JSON grammar forstring
. ↩ -
In particular, note JSON’s rules around the use of surrogate pairs for scalar values not in the Basic Multilingual Plane. We encourage implementations to avoid using
\u
escapes when producing output, and instead to rely on the UTF-8 encoding of the entire document to handle scalar values outside the ASCII range correctly. ↩ -
Because Preserves forbids unpaired surrogates in its text syntax, any valid JSON text including an unpaired surrogate code point will not be parseable using the Preserves text syntax rules. ↩
-
Rationale. While the machine-oriented syntax defines just one representation for binary data, the text syntax is intended primarily for humans to use, and so it defines many. Different usages of binary data will be more naturally expressed in text as hexadecimal, Base 64, or almost-ASCII. Accepting multiple syntax variations improves the ergonomics of the text syntax. ↩
-
Compare with the SPKI S-expression definition of “token representation”, and with the R6RS definition of identifiers. ↩
-
Implementation note. Your language’s standard library likely has a good routine for converting between decimal notation and IEEE 754 floating-point. However, if not, or if you are interested in the challenges of accurately reading and writing floating point numbers, see the excellent matched pair of 1990 papers by Clinger and Steele & White, and a recent follow-up by Jaffer:
Clinger, William D. ‘How to Read Floating Point Numbers Accurately’. In Proc. PLDI. White Plains, New York, 1990. https://doi.org/10.1145/93542.93557.
Steele, Guy L., Jr., and Jon L. White. ‘How to Print Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains, New York, 1990. https://doi.org/10.1145/93542.93559.
Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013. http://arxiv.org/abs/1310.8121. ↩
-
Implementation note. Be aware when implementing reading and writing of
SignedInteger
s that the data model requires arbitrary-precision integers. Your implementation may (but, ideally, should not) truncate precision when reading or writing aSignedInteger
; however, if it does so, it should (a) signal its client that truncation has occurred, and (b) make it clear to the client that comparing such truncated values for equality or ordering will not yield results that match the expected semantics of the data model. ↩ -
Rationale. Previous versions of this specification included an escape to the machine-oriented binary syntax by prefixing a
ByteString
containing the binary representation of aValue
with#=
. The only true need for this feature was to represent otherwise-unrepresentable floating-point values. Instead, this specification allows such floating-point values to be written directly. Removing the#=
syntax simplifies implementations (there is no longer any need to support the machine-oriented syntax) and avoids complications around treatment of annotations potentially contained within machine-encoded values. ↩ -
We encourage implementations to print annotations of the form
<interpreter
string>
using the special#!
syntax, as well, so long as the string does not contain a carriage return or line feed character. ↩ -
Annotations written with
#!
are otherwise ordinary. In particular, multiple such lines are nothing special. For example,#!/one #!/two # three #!/four five
reads as
@<interpreter "/one"> @<interpreter "/two"> @"three" @<interpreter "/four"> five