P-expressions
Tony Garnock-Jones tonyg@leastfixedpoint.com
May 2024. Version 0.3.2.
Experimental. This document defines a grammar called Preserves Expressions (P-expressions, pexprs) that includes ordinary Preserves text syntax but offers extensions sufficient to support a Lisp- or Haskell-like programming notation.
Motivation. The text syntax for Preserves works well for writing
Values, i.e. data. However, in some contexts, Preserves applications
need a broader grammar that allows interleaving of expressions with
data. Two examples are the Preserves Schema
language and the Synit configuration scripting
language, both of
which (ab)use Preserves text syntax as a kind of programming notation.
Preliminaries
The P-expression grammar includes by reference the definition of Atom from the
text syntax, as well as the definitions that Atom depends on.
ws = *(%x20 / %x09 / CR / LF)
No changes to the Preserves semantic model are made. Every Preserves text-syntax term can be parsed as a valid P-expression, but in general P-expressions must be rewritten or otherwise interpreted before a meaningful Preserves value can be arrived at (see below).
Grammar
Standalone documents containing P-expressions are sequences of
individual Exprs, followed by annotations, comments, and/or whitespace.
Document = *Expr Trailer ws
A single P-expression Expr can be an Atom from the text syntax,
a compound expression, special punctuation, an Embedded expression, or
an Annotated expression. The class SimpleExpr includes all of Expr
except special punctuation.
Expr = ws (SimpleExpr / Punct)
SimpleExpr = Atom / Compound / Embedded / Annotated
Embedded and annotated values are as in the text syntax, differing only
in that uses of Value are replaced with SimpleExpr.
Embedded = "#:" SimpleExpr
Annotated = Annotation SimpleExpr
Annotation = "@" SimpleExpr / "#" [(%x20 / %x09/ %x21) linecomment] (CR / LF)
linecomment = *<any unicode scalar value except CR or LF>
P-expression special punctuation marks are comma, semicolon, and sequences of one or more colons.1
Punct = "," / ";" / 1*":"
Compound expressions are sequences of Exprs with optional trailing
Annotations, surrounded by various kinds of parentheses.
Compound = Sequence / Record / Block / Group / Set
Sequence = "[" *Expr Trailer ws "]"
Record = "<" *Expr Trailer ws ">"
Block = "{" *Expr Trailer ws "}"
Group = "(" *Expr Trailer ws ")"
Set = "#{" *Expr Trailer ws "}"
In an Annotated P-expression, annotations and comments attach to the
term following them, just as in the ordinary text syntax. However, it is
common in programming notations to allow comments at the end of a file
or other sequential construct. The ordinary text syntax forbids comments
in these positions, but P-expressions allow them.
Trailer = *(ws Annotation)
Encoding P-expressions as Preserves
We write ⌜p⌝ for the encoding into Preserves of a P-expression p.
| ⌜·⌝ : Expr | ⟶ | Value | |
⌜[p …]⌝ |
= | [⌜p⌝ …] |
|
⌜<p …>⌝ |
= | <r ⌜p⌝ …> |
|
⌜{p …}⌝ |
= | <b ⌜p⌝ …> |
|
⌜(p …)⌝ |
= | <g ⌜p⌝ …> |
|
⌜#{p …}⌝ |
= | <s ⌜p⌝ …> |
|
⌜#:p⌝ |
= | #:⌜p⌝ |
|
⌜@p q⌝ |
= | @⌜p⌝ ⌜q⌝ |
|
| ⌜p⌝ | = | p | when p ∈ Atom |
⌜,⌝ |
= | <p ','> |
|
⌜;⌝ |
= | <p ';'> |
|
⌜: …⌝ |
= | <p ': …'> |
|
| ⌜t⌝ | = | @⌜a⌝ … <a> |
where @a … are the annotations in t and t ∈ Trailer |
The record <a> acts as an anchor for the annotations in a Trailer.
We overload the ⌜·⌝ notation for encoding whole Documents into
sequences of Preserves values.
| ⌜·⌝ : P-expression Document | ⟶ | Preserves Sequence |
| ⌜p …⌝ | = | [⌜p⌝ …] |
⌜p … @a …⌝ |
= | [⌜p⌝ … @⌜a⌝ … <a>] |
where @a … are trailing annotations |
Interpreting P-expressions as Preserves
The previous section discussed ways of representing P-expressions using Preserves. Here, we discuss interpreting P-expressions as Preserves so that (1) a Preserves datum (2) written using Preserves text syntax and then (3) read as a P-expression can be (4) interpreted from that P-expression to yield the original datum.
- Every
(..)or;that appears is an error. - Every
:,::,:::, … is an error, except in context ofBlocks as described below. - Every
,that appears is discarded. - Every
Trailerthat appears is an error.2 - Every
Recordwith no values in it is an error. - Every
Blockmust contain zero or more repeating triplets ofSimpleExpr,:,SimpleExpr. AnyBlocknot following this pattern is an error. EachBlockfollowing the pattern is translated to aDictionarycontaining a key/value pair for each triplet. AnyBlockwith duplicate keys (under interpretation) is an error. - Every
Setcontaining any duplicate expressions (under interpretation) is an error.
Appendix: Examples
Examples are given as pairs of P-expressions and their Preserves text-syntax encodings.
Individual P-expression Exprs
⌜<date 1821 (lookup-month "February") 3>⌝
= <r date 1821 <g lookup-month "February"> 3>
⌜<>⌝
= <r>
⌜(begin (println! (+ 1 2)) (+ 3 4))⌝
= <g begin <g println! <g + 1 2>> <g + 3 4>>
⌜()⌝
= <g>
⌜[() () ()]⌝
= [<g>, <g>, <g>]
⌜{
setUp();
# Now enter the loop
loop: {
greet("World");
}
tearDown();
}⌝
= <b
setUp <g> <p ';'>
# Now enter the loop
loop <p ':'> <b
greet <g "World"> <p ';'>
>
tearDown <g> <p ';'>
>
⌜[1 + 2.0, print "Hello", predicate: #t, foo, #:remote, bar]⌝
= [1 + 2.0 <p ','> print "Hello" <p ','> predicate <p ':'> #t <p ','>
foo <p ','> #:remote <p ','> bar]
⌜#{1 2 3}⌝
= <s 1 2 3>
⌜#{(read) (read) (read)}⌝
= <s <g read> <g read> <g read>>
⌜{
optional name: string,
address: Address,
}⌝
= <b
optional name <p ':'> string <p ','>
address <p ':'> Address <p ','>
>
Whole Documents
⌜#!/example/of/a/shebang
{
key: value
# example of a comment at the end of a dictionary
}
# example of a comment at the end of the input file⌝
= @<r interpreter "/example/of/a/shebang">
[ <b
key <p ':'> value
@"example of a comment at the end of a dictionary" <a>
>
@"example of a comment at the end of the input file"
<a>
]
Appendix: Reading vs. Parsing
Lisp systems first read streams of bytes into S-expressions and then parse those S-expressions into more abstract structures denoting various kinds of program syntax. Separation of reading from parsing is what gives Lisp its syntactic flexibility.
Similarly, the Apple programming language Dylan included a reader-parser split, with the Dylan reader producing D-expressions that are somewhat similar to P-expressions.
Finally, the Racket dialects Honu and Something use a reader-parser-macro setup, where the reader produces Racket data, the parser produces “syntax” and is user-extensible, and Racket’s own modular macro system rewrites this “syntax” down to core forms to be compiled to machine code.
Similarly, when using P-expressions as the foundation for a language, a generic P-expression reader can then feed into special-purpose parsers. The reader captures the coarse syntactic structure of a program, and the parser refines this.
Often, a parser will wish to extract structure from sequences of
P-expression Exprs.
-
A simple technique is repeated splitting of sequences; first by
Semicolon, then byComma, then by increasingly high binding-power operators. -
More refined is to use a Pratt parser or similar (1, 2, 3) to build a parse tree using an extensible specification of the pre-, in-, and postfix operators involved.
-
Finally, if you treat sequences of
Exprs as pre-lexed token streams, almost any parsing formalism (such as PEG parsing, Ometa, etc.) can be used to extract further syntactic structure.
Appendix: Equations for interpreting P-expressions as Preserves
The function uncomma(p) removes all occurrences of , from a
P-expression p ∈ Expr − {,}.
uncomma : Expr − {,} |
⟶ | Expr | |
uncomma([p …]) |
= | [uncomma(p) …] |
omitting any p = , |
uncomma(<p …>) |
= | <uncomma(p) …> |
omitting any p = , |
uncomma({p …}) |
= | {uncomma(p) …} |
omitting any p = , |
uncomma((p …)) |
= | (uncomma(p) …) |
omitting any p = , |
uncomma(#{p …}) |
= | #{uncomma(p) …} |
omitting any p = , |
uncomma(#:p) |
= | #:uncomma(p) |
|
uncomma(@p q) |
= | @uncomma(p) uncomma(q) |
|
| uncomma(p) | = | p | if p ∈ Atom ∪ Punct − {,} |
We write ⌞uncomma(p)⌟ for the partial function mapping a
P-expression p ∈ Expr − {,} to a corresponding Preserves Value.
⌞·⌟ : Expr − {,} |
⇀ | Value | |
⌞[p …]⌟ |
= | [⌞p⌟ …] |
|
⌞<ℓ p …>⌟ |
= | <⌞ℓ⌟ ⌞p⌟ …> |
|
⌞{k:v …}⌟ |
= | {⌞k⌟:⌞v⌟ …} |
if all ⌞k⌟ … are distinct |
⌞#{p …}⌟ |
= | #{⌞p⌟ …} |
if all ⌞p⌟ … are distinct |
⌞#:p⌟ |
= | #:⌞p⌟ |
|
⌞@p q⌟ |
= | @⌞p⌟ ⌞q⌟ |
|
| ⌞p⌟ | = | p | when p ∈ Atom |
Notes
-
Colon matching is greedy: when reading, all adjacent colons are always taken into a single token, and when writing, adjacent colon-sequence punctuation marks must be written with whitespace separating them. ↩
-
Implementation note. When implementing parsing of P-expressions into Preserves, consider offering an optional mode where trailing annotations
Trailerare discarded instead of causing an error to be signalled. ↩