Preserves: Binary Syntax
Tony Garnock-Jones tonyg@leastfixedpoint.com
June 2025. Version 0.996.3.
Preserves is a data model, with associated serialization formats. This
document defines one of those formats: a binary syntax for Values from
the Preserves data model that is easy for computer
software to read and write. An equivalent human-readable text
syntax also exists.
Machine-Oriented Binary Syntax
A Repr is a binary-syntax encoding, or representation, of a Value.
For a value v, we write «v» for the Repr of v.
Type and Length representation.
Each Repr starts with a tag byte, describing the kind of information
represented. Depending on the tag, a length indicator, further encoded
information, and/or an ending tag may follow.
tag                          (simple atomic data)
tag ++ length ++ binarydata  (doubles, integers, strings, symbols, and binary)
tag ++ repr ++ ... ++ endtag (compound data)
The unique end tag is byte value 0x84.
If present after a tag, the length of a following piece of binary data
is formatted as a base 128 varint.1 We
write varint(m) for the varint-encoding of m. Quoting the
Google Protocol Buffers definition,
Each byte in a varint, except the last byte, has the most significant bit (msb) set – this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two’s complement representation of the number in groups of 7 bits, least significant group first.
For example, varint(15) is [0x0F], and varint(1000000000) is [0x80, 0x94, 0xeb, 0xdc,
0x03].
It is an error for a varint-encoded m in a Repr to be anything
other than the unique shortest encoding for that m. That is, a
varint-encoding of m MUST NOT end in 0 unless m=0.
Records, Sequences, Sets and Dictionaries.
      «<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
        «[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
       «#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
«{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]
There is no ordering requirement on the E_i elements or
K_i/V_i pairs.2 They may appear in any
order. However, the E_i and K_i MUST be pairwise distinct. In
addition, implementations SHOULD default to writing set elements and
dictionary key/value pairs in order sorted lexicographically by their
Reprs3, and MAY offer the option of
serializing in some other implementation-defined order.
SignedIntegers, Strings, ByteStrings and Symbols.
«S» = [0xB0] ++ varint(|intbytes(S)|) ++ intbytes(S)  if S ∈ SignedInteger
      [0xB1] ++ varint(|utf8(S)|) ++ utf8(S)          if S ∈ String
      [0xB2] ++ varint(|S|) ++ S                      if S ∈ ByteString
      [0xB3] ++ varint(|utf8(S)|) ++ utf8(S)          if S ∈ Symbol
For String and Symbol, the data following the tag and length is a
UTF-8 encoding of the Value. For ByteString, it is the raw data
contained within the Value unmodified. For SignedInteger, it is
the big-endian two’s-complement binary representation of the number,
taking exactly as many whole bytes as needed to unambiguously identify
the value and its sign. intbytes(0) is special-cased to be the empty
byte sequence;4 otherwise, the
most-significant bit of the first byte is the sign bit. (See
SignedInteger examples in the appendix
below.)
Booleans.
«#f» = [0x80]
«#t» = [0x81]
Doubles.
«D» = [0x87, 0x08] ++ binary64(D)  if D ∈ Double
The function binary64(D) yields the big-endian 8-byte IEEE 754 binary
representation of D.
Embeddeds.
«#:V» = [0x86] ++ «V»
The Repr of an Embedded is the Repr of a Value chosen to
represent the denoted object, prefixed with [0x86].
Annotations.
«@W V» = [0x85] ++ «W» «V»
Each annotation W precedes the Value V it
annotates.5 Both W and V MAY
themselves be further annotated. Implementations SHOULD default to
omitting annotations from Reprs. See examples in the
appendix.
Security Considerations
Annotations. In modes where a Value is being read while
annotations are skipped, an endless series of annotations may give an
illusion of progress.
Canonical form for cryptographic hashing and signing. No canonical
textual encoding of a Value is specified. However, a canonical
form exists for binary encoded Values, and
implementations SHOULD produce canonical binary encodings by
default; however, an implementation MAY permit two serializations of
the same Value to yield different binary Reprs.
Appendix. Autodetection of textual or binary syntax
Every tag byte in a binary Preserves Document falls within the range
[0x80, 0xBF]. These bytes, interpreted as UTF-8, are continuation
bytes, and will never occur as the first byte of a UTF-8 encoding. This
means no binary-encoded document can be misinterpreted as valid UTF-8.
Conversely, a UTF-8 document must start with a valid scalar value,
meaning in particular that it must not start with a byte in the range
[0x80, 0xBF]. This means that no UTF-8 encoded textual-syntax
Preserves document can be misinterpreted as a binary-syntax document.
Examination of the top two bits of the first byte of a document gives
its syntax: if the top two bits are 10, it should be interpreted as
a binary-syntax document; otherwise, it should be interpreted as text.
Appendix. Table of tag values
 80 - False
 81 - True
 84 - End marker
 85 - Annotation
 86 - Embedded
 87 - Double
 B0 - SignedInteger
 B1 - String
 B2 - ByteString
 B3 - Symbol
 B4 - Record
 B5 - Sequence
 B6 - Set
 B7 - Dictionary
All tags fall in the range [0x80, 0xBF].
Tag values 82, 83, 88…AF, and B8…BF are reserved.
Appendix. Binary SignedInteger representation
Languages that provide fixed-width machine word types may find the
following table useful in encoding and decoding binary SignedInteger
values.
| Integer range | Bytes required | Encoding (hex) | 
|---|---|---|
| 0 | 2 | B0 00 | 
    
| -27 ≤ n < 27 (i8) | 3 | B0 01 XX | 
    
| -215 ≤ n < 215 (i16) | 4 | B0 02 XX XX | 
    
| -223 ≤ n < 223 (i24) | 5 | B0 03 XX XX XX | 
    
| -231 ≤ n < 231 (i32) | 6 | B0 04 XX XX XX XX | 
    
| -239 ≤ n < 239 (i40) | 7 | B0 05 XX XX XX XX XX | 
    
| -247 ≤ n < 247 (i48) | 8 | B0 06 XX XX XX XX XX XX | 
    
| -255 ≤ n < 255 (i56) | 9 | B0 07 XX XX XX XX XX XX XX | 
    
| -263 ≤ n < 263 (i64) | 10 | B0 08 XX XX XX XX XX XX XX XX | 
    
Appendix. Examples
SignedInteger examples
  «-257» = B0 02 FE FF     «-2» = B0 01 FE       «255» = B0 02 00 FF
  «-256» = B0 02 FF 00     «-1» = B0 01 FF       «256» = B0 02 01 00
  «-255» = B0 02 FF 01      «0» = B0 00        «32767» = B0 02 7F FF
  «-129» = B0 02 FF 7F      «1» = B0 01 01     «32768» = B0 03 00 80 00
  «-128» = B0 01 80       «127» = B0 01 7F     «65535» = B0 03 00 FF FF
  «-127» = B0 01 81       «128» = B0 02 00 80  «65536» = B0 03 01 00 00
  «87112285931760246646623899502532662132736»
    = B0 12 01 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00
            00 00
Annotation examples
The Repr corresponding to textual syntax @a@b[], i.e. an empty
sequence annotated with two symbols, a and b, is
«@a @b []» = [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
           = [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]
Annotations may themselves be annotated. Here, c is annotated with
b, which itself is annotated with a:
«@ @a b c» = [0x85] ++ [0x85] ++ «a» ++ «b» ++ «c»>
Notes
- 
      
Also known as LEB128 encoding, for unsigned integers. Varints and LEB128-encoded integers differ only for negative numbers, which cannot appear as length indicators and are thus not used in Preserves. ↩
 - 
      
In the BitTorrent encoding format, bencoding, dictionary key/value pairs must be sorted by key. This is a necessary step for ensuring serialization of
Values is canonical. We encourage, but do not require that key/value pairs (or set elements) be in sorted order for serializedValues; however, a canonical form forReprs does exist where a sorted ordering is required. ↩ - 
      
It’s important to note that the sort ordering for writing out set elements and dictionary key/value pairs is not the same as the sort ordering implied by the semantic ordering of those elements or keys. For example, the
Reprof a negative number sorts lexicographically after theReprof zero, despite being semantically less than zero.Rationale. This is for ease-of-implementation reasons: not all languages can easily represent sorted sets or sorted dictionaries, but encoding and then sorting byte strings is much more likely to be within easy reach. ↩
 - 
      
Without the special case of
intbytes(0)yielding the empty byte sequence, no input tointbyteswould result in an empty sequence. In principle, either0or-1could be special-cased to the empty sequence; here, we arbitrarily choose0. ↩ - 
      
Rationale. The syntax for annotated values is unlike any other compound syntax defined here. It allows parsers to statelessly skip annotations: while
0x85is the next input byte, skip it, skip a singleRepr, and then try again; otherwise, parse the input stream as usual.By contrast, one might imagine using a parenthesized syntax like the syntax of compound data. In that case, parsers would have to remember information about nesting of annotations. This would not necessarily be an issue for recursive (“DOM”-like) parsers, but streaming (“SAX”-like) parsers would have to either take on complexity internally or force it on their clients, even when those clients did not care about annotations. ↩