290 lines
7.8 KiB
Markdown
290 lines
7.8 KiB
Markdown
# A library for grammars
|
|
|
|
This is library to do interesting things with grammars. This was
|
|
originally built as a little toy for me to understand how LR parser
|
|
tables worked, but I discovered that what I *really* want is to be
|
|
able to leverage the grammar to do other things besides parsing.
|
|
|
|
The primary inspiration for this library is tree-sitter, which also
|
|
generates LR parsers for grammars written in a turing-complete
|
|
language. Like that, we write grammars in a language, only we do it in
|
|
Python instead of JavaScript.
|
|
|
|
## Making Grammars
|
|
|
|
To get started, create a grammar that derives from the `Grammar`
|
|
class. Create one method per non-terminal, decorated with the `rule`
|
|
decorator. Here's an example:
|
|
|
|
```python
|
|
from parser import *
|
|
|
|
@rule
|
|
def expression():
|
|
return seq(expression, PLUS, term) | term
|
|
|
|
@rule
|
|
def term():
|
|
return seq(LPAREN, expression, RPAREN) | ID
|
|
|
|
PLUS = Terminal('PLUS', '+')
|
|
LPAREN = Terminal('LPAREN', '(')
|
|
RPAREN = Terminal('RPAREN', ')')
|
|
ID = Terminal(
|
|
'ID',
|
|
Re.seq(
|
|
Re.set(("a", "z"), ("A", "Z"), "_"),
|
|
Re.set(("a", "z"), ("A", "Z"), ("0", "9"), "_").star(),
|
|
),
|
|
)
|
|
|
|
SimpleGrammar = Grammar(
|
|
name="Simple",
|
|
start=expression,
|
|
)
|
|
```
|
|
|
|
Terminals can be plain strings or regular expressions constructed with
|
|
the `Re` object. (Ironically, I guess this library is not clever
|
|
enough to parse a regular expression string into one of these
|
|
structures. If you want to build one, go nuts! It's just Python, you
|
|
can do whatever you want so long as the result is an `Re` object.)
|
|
|
|
Productions can be built out of terminals and non-terminals,
|
|
concatenated with the `seq` function or the `+` operator. Alternatives
|
|
can be expressed with the `alt` function or the `|` operator. These
|
|
things can be freely nested, as desired.
|
|
|
|
There are no helpers (yet!) for consuming lists, so they need to be
|
|
constructed in the classic context-free grammar way:
|
|
|
|
```python
|
|
@rule
|
|
def list():
|
|
return NUMBER | (list + COMMA + NUMBER)
|
|
|
|
NUMBER = Terminal(Re.set(("0", "9")).plus())
|
|
COMMA = Terminal(',')
|
|
|
|
NumberList = Grammar(
|
|
name="NumberList",
|
|
start=list,
|
|
)
|
|
```
|
|
|
|
(Unlike with PEGs, you can write grammars with left or right-recursion,
|
|
without restriction, either is fine.)
|
|
|
|
When used to generate a parser, the grammar describes a concrete
|
|
syntax tree. Unfortunately, that means that the list example above
|
|
will generate a very awkward tree for `1,2,3`:
|
|
|
|
```
|
|
list
|
|
list
|
|
list
|
|
NUMBER ("1")
|
|
COMMA
|
|
NUMBER ("2")
|
|
COMMA
|
|
NUMBER ("3")
|
|
```
|
|
|
|
In order to make this a little cleaner, rules can be "transparent",
|
|
which means they don't generate nodes in the tree and just dump their
|
|
contents into the parent node instead.
|
|
|
|
```python
|
|
@rule
|
|
def list():
|
|
# The starting rule can't be transparent: there has to be something to
|
|
# hold on to!
|
|
return transparent_list
|
|
|
|
@rule(transparent=True)
|
|
def transparent_list() -> Rule:
|
|
return NUMBER | (transparent_list + COMMA + NUMBER)
|
|
|
|
NUMBER = Terminal(Re.set(("0", "9")).plus())
|
|
COMMA = Terminal(',')
|
|
|
|
NumberList = Grammar(
|
|
name="NumberList",
|
|
start=list,
|
|
)
|
|
```
|
|
|
|
This grammar will generate the far more useful tree:
|
|
|
|
```
|
|
list
|
|
NUMBER ("1")
|
|
COMMA
|
|
NUMBER ("2")
|
|
COMMA
|
|
NUMBER ("3")
|
|
```
|
|
|
|
Rules that start with `_` are also interpreted as transparent,
|
|
following the lead set by tree-sitter, and so the grammar above is
|
|
probably better-written as:
|
|
|
|
```python
|
|
@rule
|
|
def list():
|
|
# The starting rule can't be transparent: there has to be something to
|
|
# hold on to!
|
|
return transparent_list
|
|
|
|
@rule
|
|
def _list() -> Rule:
|
|
return NUMBER | (_list + COMMA + NUMBER)
|
|
|
|
NUMBER = Terminal(Re.set(("0", "9")).plus())
|
|
COMMA = Terminal(',')
|
|
|
|
NumberList = Grammar(
|
|
name="NumberList",
|
|
start=list,
|
|
)
|
|
```
|
|
|
|
That will generate the same tree, but a little more succinctly.
|
|
|
|
Of course, it's a lot of work to write these transparent recursive
|
|
rules by hand all the time, so there are helpers that do it for you:
|
|
|
|
```python
|
|
@rule
|
|
def list():
|
|
return zero_or_more(NUMBER, COMMA) + NUMBER
|
|
|
|
NUMBER = Terminal(Re.set(("0", "9")).plus())
|
|
COMMA = Terminal(',')
|
|
|
|
NumberList = Grammar(
|
|
name="NumberList",
|
|
start=list,
|
|
)
|
|
```
|
|
|
|
Much better.
|
|
|
|
### Trivia
|
|
|
|
Most folks that want to parse something want to skip blanks when they
|
|
do it. Our grammars don't say anything about that by default (sorry),
|
|
so you probably want to be explicit about such things.
|
|
|
|
To allow (and ignore) spaces, newlines, tabs, and carriage-returns in
|
|
our number lists, we would modify the grammar as follows:
|
|
|
|
```python
|
|
@rule
|
|
def list():
|
|
return zero_or_more(NUMBER, COMMA) + NUMBER
|
|
|
|
NUMBER = Terminal(Re.set(("0", "9")).plus())
|
|
COMMA = Terminal(',')
|
|
|
|
BLANKS = Terminal(Re.set(" ", "\t", "\r", "\n").plus())
|
|
|
|
NumberList = Grammar(
|
|
name="NumberList",
|
|
start=list,
|
|
trivia=[BLANKS],
|
|
)
|
|
```
|
|
|
|
Now we can parse a list with spaces! "1 , 2, 3" will parse happily
|
|
into:
|
|
|
|
```
|
|
list
|
|
NUMBER ("1")
|
|
COMMA
|
|
NUMBER ("2")
|
|
COMMA
|
|
NUMBER ("3")
|
|
```
|
|
|
|
## Using Grammars
|
|
|
|
### Making Parsers and Parsing Text
|
|
|
|
Once you have a grammar you can make a parse table from it by
|
|
constructing an instance of the grammar and calling the `build_table`
|
|
method on it.
|
|
|
|
```python
|
|
grammar = NumberList()
|
|
parse_table = grammar.build_table()
|
|
lexer_table = grammar.compile_lexer()
|
|
```
|
|
|
|
In theory, in the future, you could pass the table to an output
|
|
generator and it would build a C source file or a Rust source file or
|
|
something to run the parse. Right now the only runtime is also written
|
|
in python, so you can do a parse as follows:
|
|
|
|
```
|
|
from parser import runtime
|
|
|
|
text = "1,2,3"
|
|
result, errors = runtime.parse(parse_table, lexer_table, "1,2,3")
|
|
```
|
|
|
|
`result` in the above example will be a concrete syntax tree, if the
|
|
parse was successful, and `errors` will be a list of error strings
|
|
from the parse. Note that the python runtime has automatic error
|
|
recovery (with a variant of
|
|
[CPCT+](https://tratt.net/laurie/blog/2020/automatic_syntax_error_recovery.html)),
|
|
so you may get a parse tree even if there were parse errors.
|
|
|
|
## Questions
|
|
|
|
### Why Python?
|
|
|
|
There are a few reasons to use python here.
|
|
|
|
First, Python 3 is widely pre-installed on MacOS and Linux. This
|
|
library requires nothing more than the basic standard library, and not
|
|
even a new version of it. Therefore, it turns out to be a pretty light
|
|
dependency for a rust or C++ or some other kind of project, where
|
|
you're using this to generate the parser tables but the parser itself
|
|
will be in some other language.
|
|
|
|
(Tree-sitter, on the other hand, requires its own standalone binary in
|
|
addition to node, which is a far less stable and available runtime in
|
|
2024.)
|
|
|
|
I also find the ergonomics of working in python a little nicer than
|
|
working in, say, JavaScript. Python gives me operator overloading for
|
|
things like `|` and `+`, which make the rules read a little closer to
|
|
EBNF for me. It gives me type annotations that work without running a
|
|
compiler over my input.
|
|
|
|
It also *actually raises errors* when I accidentally misspell the name
|
|
of a rule. And those errors come with the source location of exactly
|
|
where I made the spelling mistake!
|
|
|
|
Finally, I guess you could ask why I'm not using some DSL or something
|
|
like literally every other parser generator tool except for
|
|
tree-sitter. And the answer for that is: I just don't care to maintain
|
|
a parser for my parser generator. ("Yo dawg, I heard you liked
|
|
parsers...") Python gives me the ability to describe the data I want,
|
|
in an easy to leverage way, that comes with all the power and
|
|
flexibility of a general-purpose programming language. Turns out to be
|
|
pretty nice.
|
|
|
|
### What about grammars where blank space is significant, like ... well, python?
|
|
|
|
Right now there's no way to describe them natively.
|
|
|
|
You could write the grammar and introduce terminals like `INDENT` and
|
|
`DEDENT` but you would have to write a custom lexer to produce those
|
|
terminals, and probably handle them differently in all the other uses
|
|
of the grammar as well.
|
|
|
|
That limits the ability to write the grammar once and automatically
|
|
use it everywhere, but maybe it's good enough for you?
|