P-expressions

Tony Garnock-Jones tonyg@leastfixedpoint.com
May 2024. Version 0.3.2.

Experimental. This document defines a grammar called Preserves Expressions (P-expressions, pexprs) that includes ordinary Preserves text syntax but offers extensions sufficient to support a Lisp- or Haskell-like programming notation.

Motivation. The text syntax for Preserves works well for writing Values, i.e. data. However, in some contexts, Preserves applications need a broader grammar that allows interleaving of expressions with data. Two examples are the Preserves Schema language and the Synit configuration scripting language, both of which (ab)use Preserves text syntax as a kind of programming notation.

Preliminaries

The P-expression grammar includes by reference the definition of Atom from the text syntax, as well as the definitions that Atom depends on.

Whitespace. Whitespace ws is, as in the text syntax, defined as any number of spaces, tabs, carriage returns, or line feeds.

            ws = *(%x20 / %x09 / CR / LF)

No changes to the Preserves semantic model are made. Every Preserves text-syntax term can be parsed as a valid P-expression, but in general P-expressions must be rewritten or otherwise interpreted before a meaningful Preserves value can be arrived at (see below).

Grammar

Standalone documents containing P-expressions are sequences of individual Exprs, followed by annotations, comments, and/or whitespace.

      Document = *Expr Trailer ws

A single P-expression Expr can be an Atom from the text syntax, a compound expression, special punctuation, an Embedded expression, or an Annotated expression. The class SimpleExpr includes all of Expr except special punctuation.

          Expr = ws (SimpleExpr / Punct)
    SimpleExpr = Atom / Compound / Embedded / Annotated

Embedded and annotated values are as in the text syntax, differing only in that uses of Value are replaced with SimpleExpr.

       Embedded = "#:" SimpleExpr
      Annotated = Annotation SimpleExpr
     Annotation = "@" SimpleExpr / "#" [(%x20 / %x09/ %x21) linecomment] (CR / LF)
    linecomment = *<any unicode scalar value except CR or LF>

P-expression special punctuation marks are comma, semicolon, and sequences of one or more colons.¹

         Punct = "," / ";" / 1*":"

Compound expressions are sequences of Exprs with optional trailing Annotations, surrounded by various kinds of parentheses.

      Compound = Sequence / Record / Block / Group / Set
      Sequence =  "[" *Expr Trailer ws "]"
        Record =  "<" *Expr Trailer ws ">"
         Block =  "{" *Expr Trailer ws "}"
         Group =  "(" *Expr Trailer ws ")"
           Set = "#{" *Expr Trailer ws "}"

In an Annotated P-expression, annotations and comments attach to the term following them, just as in the ordinary text syntax. However, it is common in programming notations to allow comments at the end of a file or other sequential construct. The ordinary text syntax forbids comments in these positions, but P-expressions allow them.

       Trailer = *(ws Annotation)

Encoding P-expressions as Preserves

We write ⌜p⌝ for the encoding into Preserves of a P-expression p.

⌜·⌝ : Expr	⟶	Value
⌜`[`p …`]`⌝	=	`[`⌜p⌝ …`]`
⌜`<`p …`>`⌝	=	`<r` ⌜p⌝ …`>`
⌜`{`p …`}`⌝	=	`<b` ⌜p⌝ …`>`
⌜`(`p …`)`⌝	=	`<g` ⌜p⌝ …`>`
⌜`#{`p …`}`⌝	=	`<s` ⌜p⌝ …`>`
⌜`#:`p⌝	=	`#:`⌜p⌝
⌜`@`p q⌝	=	`@`⌜p⌝ ⌜q⌝
⌜p⌝	=	p	when p ∈ Atom
⌜`,`⌝	=	`<p ','>`
⌜`;`⌝	=	`<p ';'>`
⌜`:` …⌝	=	`<p ':` …`'>`
⌜t⌝	=	`@`⌜a⌝ … `<a>`	where `@`a … are the annotations in t and t ∈ Trailer

The record <a> acts as an anchor for the annotations in a Trailer.

We overload the ⌜·⌝ notation for encoding whole Documents into sequences of Preserves values.

⌜·⌝ : P-expression Document	⟶	Preserves Sequence
⌜p …⌝	=	`[`⌜p⌝ …`]`
⌜p … `@`a …⌝	=	`[`⌜p⌝ … `@`⌜a⌝ … `<a>]`
		where `@`a … are trailing annotations

Interpreting P-expressions as Preserves

The previous section discussed ways of representing P-expressions using Preserves. Here, we discuss interpreting P-expressions as Preserves so that (1) a Preserves datum (2) written using Preserves text syntax and then (3) read as a P-expression can be (4) interpreted from that P-expression to yield the original datum.

Every (..) or ; that appears is an error.
Every :, ::, :::, … is an error, except in context of Blocks as described below.
Every , that appears is discarded.
Every Trailer that appears is an error.²
Every Record with no values in it is an error.
Every Block must contain zero or more repeating triplets of SimpleExpr, :, SimpleExpr. Any Block not following this pattern is an error. Each Block following the pattern is translated to a Dictionary containing a key/value pair for each triplet. Any Block with duplicate keys (under interpretation) is an error.
Every Set containing any duplicate expressions (under interpretation) is an error.

Appendix: Examples

Examples are given as pairs of P-expressions and their Preserves text-syntax encodings.

Individual P-expression `Expr`s

 ⌜<date 1821 (lookup-month "February") 3>⌝
= <r date 1821 <g lookup-month "February"> 3>

 ⌜<>⌝
= <r>

 ⌜(begin (println! (+ 1 2)) (+ 3 4))⌝
= <g begin <g println! <g + 1 2>> <g + 3 4>>

 ⌜()⌝
= <g>

 ⌜[() () ()]⌝
= [<g>, <g>, <g>]

 ⌜{
      setUp();
      # Now enter the loop
      loop: {
          greet("World");
      }
      tearDown();
  }⌝
= <b
      setUp <g> <p ';'>
      # Now enter the loop
      loop <p ':'> <b
          greet <g "World"> <p ';'>
      >
      tearDown <g> <p ';'>
  >

 ⌜[1 + 2.0, print "Hello", predicate: #t, foo, #:remote, bar]⌝
= [1 + 2.0 <p ','> print "Hello" <p ','> predicate <p ':'> #t <p ','>
   foo <p ','> #:remote <p ','> bar]

 ⌜#{1 2 3}⌝
= <s 1 2 3>

 ⌜#{(read) (read) (read)}⌝
= <s <g read> <g read> <g read>>

 ⌜{
      optional name: string,
      address: Address,
  }⌝
= <b
      optional name <p ':'> string <p ','>
      address <p ':'> Address <p ','>
  >

Whole `Document`s

 ⌜#!/example/of/a/shebang
  {
      key: value
      # example of a comment at the end of a dictionary
  }
  # example of a comment at the end of the input file⌝
= @<r interpreter "/example/of/a/shebang">
  [ <b
        key <p ':'> value
        @"example of a comment at the end of a dictionary" <a>
    >
    @"example of a comment at the end of the input file"
    <a>
  ]

Appendix: Reading vs. Parsing

Lisp systems first read streams of bytes into S-expressions and then parse those S-expressions into more abstract structures denoting various kinds of program syntax. Separation of reading from parsing is what gives Lisp its syntactic flexibility.

Similarly, the Apple programming language Dylan included a reader-parser split, with the Dylan reader producing D-expressions that are somewhat similar to P-expressions.

Finally, the Racket dialects Honu and Something use a reader-parser-macro setup, where the reader produces Racket data, the parser produces “syntax” and is user-extensible, and Racket’s own modular macro system rewrites this “syntax” down to core forms to be compiled to machine code.

Similarly, when using P-expressions as the foundation for a language, a generic P-expression reader can then feed into special-purpose parsers. The reader captures the coarse syntactic structure of a program, and the parser refines this.

Often, a parser will wish to extract structure from sequences of P-expression Exprs.

A simple technique is repeated splitting of sequences; first by Semicolon, then by Comma, then by increasingly high binding-power operators.
More refined is to use a Pratt parser or similar (1, 2, 3) to build a parse tree using an extensible specification of the pre-, in-, and postfix operators involved.
Finally, if you treat sequences of Exprs as pre-lexed token streams, almost any parsing formalism (such as PEG parsing, Ometa, etc.) can be used to extract further syntactic structure.

Appendix: Equations for interpreting P-expressions as Preserves

The function uncomma(p) removes all occurrences of , from a P-expression p ∈ Expr − {,}.

uncomma : Expr − {`,`}	⟶	Expr
uncomma(`[`p …`]`)	=	`[`uncomma(p) …`]`	omitting any p = `,`
uncomma(`<`p …`>`)	=	`<`uncomma(p) …`>`	omitting any p = `,`
uncomma(`{`p …`}`)	=	`{`uncomma(p) …`}`	omitting any p = `,`
uncomma(`(`p …`)`)	=	`(`uncomma(p) …`)`	omitting any p = `,`
uncomma(`#{`p …`}`)	=	`#{`uncomma(p) …`}`	omitting any p = `,`
uncomma(`#:`p)	=	`#:`uncomma(p)
uncomma(`@`p q)	=	`@`uncomma(p) uncomma(q)
uncomma(p)	=	p	if p ∈ Atom ∪ Punct − {`,`}

We write ⌞uncomma(p)⌟ for the partial function mapping a P-expression p ∈ Expr − {,} to a corresponding Preserves Value.

⌞·⌟ : Expr − {`,`}	⇀	Value
⌞`[`p …`]`⌟	=	`[`⌞p⌟ …`]`
⌞`<`ℓ p …`>`⌟	=	`<`⌞ℓ⌟ ⌞p⌟ …`>`
⌞`{`k`:`v …`}`⌟	=	`{`⌞k⌟`:`⌞v⌟ …`}`	if all ⌞k⌟ … are distinct
⌞`#{`p …`}`⌟	=	`#{`⌞p⌟ …`}`	if all ⌞p⌟ … are distinct
⌞`#:`p⌟	=	`#:`⌞p⌟
⌞`@`p q⌟	=	`@`⌞p⌟ ⌞q⌟
⌞p⌟	=	p	when p ∈ Atom

Notes

Colon matching is greedy: when reading, all adjacent colons are always taken into a single token, and when writing, adjacent colon-sequence punctuation marks must be written with whitespace separating them. ↩
Implementation note. When implementing parsing of P-expressions into Preserves, consider offering an optional mode where trailing annotations Trailer are discarded instead of causing an error to be signalled. ↩