[readme] Rewrite the readme and add a helper
The helper is nice actually.
This commit is contained in:
parent
b1f4c56f49
commit
071cd29d8f
3 changed files with 242 additions and 113 deletions
312
README.md
312
README.md
|
|
@ -1,30 +1,25 @@
|
|||
# A collection of LR parser generators, from LR0 through LALR.
|
||||
# A library for grammars
|
||||
|
||||
This is a small helper library to generate LR parser tables.
|
||||
This is library to do interesting things with grammars. This was
|
||||
originally built as a little toy for me to understand how LR parser
|
||||
tables worked, but I discovered that what I *really* want is to be
|
||||
able to leverage the grammar to do other things besides parsing.
|
||||
|
||||
The primary inspiration for this library is tree-sitter, which also generates
|
||||
LR parsers for grammars written in a turing-complete language. Like that, we
|
||||
write grammars in a language, only we do it in Python instead of JavaScript.
|
||||
|
||||
Why Python? Because Python 3 is widely pre-installed on MacOS and Linux. This
|
||||
library requires nothing more than the basic standard library, and not even a
|
||||
new version of it. Therefore, it turns out to be a pretty light dependency for
|
||||
a rust or C++ or something kind of project. (Tree-sitter, on the other hand,
|
||||
requires node, which is a far less stable and available runtime in 2024.)
|
||||
|
||||
The parser tables can really be used to power anything. I prefer to make
|
||||
concrete syntax trees (again, see tree-sitter), and there is no facility at all
|
||||
for actions or custom ASTs or whatnot. Any such processing needs to be done by
|
||||
the thing that processes the tables.
|
||||
The primary inspiration for this library is tree-sitter, which also
|
||||
generates LR parsers for grammars written in a turing-complete
|
||||
language. Like that, we write grammars in a language, only we do it in
|
||||
Python instead of JavaScript.
|
||||
|
||||
## Making Grammars
|
||||
|
||||
To get started, create a grammar that derives from the `Grammar` class. Create
|
||||
one method per nonterminal, decorated with the `rule` decorator. Here's an
|
||||
example:
|
||||
|
||||
To get started, create a grammar that derives from the `Grammar`
|
||||
class. Create one method per non-terminal, decorated with the `rule`
|
||||
decorator. Here's an example:
|
||||
|
||||
```python
|
||||
class SimpleGrammar(Grammar):
|
||||
start = "expression"
|
||||
|
||||
@rule
|
||||
def expression(self):
|
||||
return seq(self.expression, self.PLUS, self.term) | self.term
|
||||
|
|
@ -36,98 +31,231 @@ example:
|
|||
PLUS = Terminal('+')
|
||||
LPAREN = Terminal('(')
|
||||
RPAREN = Terminal(')')
|
||||
ID = Terminal('id')
|
||||
ID = Terminal(
|
||||
Re.seq(
|
||||
Re.set(("a", "z"), ("A", "Z"), "_"),
|
||||
Re.set(("a", "z"), ("A", "Z"), ("0", "9"), "_").star(),
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
Terminals can be plain strings or regular expressions constructed with
|
||||
the `Re` object. (Ironically, I guess this library is not clever
|
||||
enough to parse a regular expression string into one of these
|
||||
structures. If you want to build one, go nuts! It's just Python, you
|
||||
can do whatever you want so long as the result is an `Re` object.)
|
||||
|
||||
## Using grammars
|
||||
Productions can be built out of terminals and non-terminals,
|
||||
concatenated with the `seq` function or the `+` operator. Alternatives
|
||||
can be expressed with the `alt` function or the `|` operator. These
|
||||
things can be freely nested, as desired.
|
||||
|
||||
TODO
|
||||
There are no helpers (yet!) for consuming lists, so they need to be
|
||||
constructed in the classic context-free grammar way:
|
||||
|
||||
## Representation Choices
|
||||
```python
|
||||
class NumberList(Grammar):
|
||||
start = "list"
|
||||
|
||||
The SimpleGrammar class might seem a little verbose compared to a dense
|
||||
structure like:
|
||||
@rule
|
||||
def list(self):
|
||||
return self.NUMBER | (self.list + self.COMMA + self.NUMBER)
|
||||
|
||||
grammar_simple = [
|
||||
('E', ['E', '+', 'T']),
|
||||
('E', ['T']),
|
||||
('T', ['(', 'E', ')']),
|
||||
('T', ['id']),
|
||||
]
|
||||
NUMBER = Terminal(Re.set(("0", "9")).plus())
|
||||
COMMA = Terminal(',')
|
||||
```
|
||||
|
||||
or
|
||||
(Unlike with PEGs, you can write grammars with left or right-recursion,
|
||||
without restriction, either is fine.)
|
||||
|
||||
grammar_simple = {
|
||||
'E': [
|
||||
['E', '+', 'T'],
|
||||
['T'],
|
||||
],
|
||||
'T': [
|
||||
['(', 'E', ')'],
|
||||
['id'],
|
||||
],
|
||||
}
|
||||
When used to generate a parser, the grammar describes a concrete
|
||||
syntax tree. Unfortunately, that means that the list example above
|
||||
will generate a very awkward tree for `1,2,3`:
|
||||
|
||||
```
|
||||
list
|
||||
list
|
||||
list
|
||||
NUMBER ("1")
|
||||
COMMA
|
||||
NUMBER ("2")
|
||||
COMMA
|
||||
NUMBER ("3")
|
||||
```
|
||||
|
||||
The advantage that the class has over a table like this is that you get to have
|
||||
all of your Python tools help you make sure your grammar is good, if you want
|
||||
them. e.g., if you're working with an LSP or something, the members give you
|
||||
autocomplete and jump-to-definition and possibly even type-checking.
|
||||
In order to make this a little cleaner, rules can be "transparent",
|
||||
which means they don't generate nodes in the tree and just dump their
|
||||
contents into the parent node instead.
|
||||
|
||||
At the very least, if you mis-type the name of a nonterminal, or forget to
|
||||
implement it, we will immediately raise an error that *INCLUDES THE LOCATION IN
|
||||
THE SOURCE WHERE THE ERROR WAS MADE.* With tables, we can tell you that you
|
||||
made a mistake but it's up to you to figure out where you did it.
|
||||
```python
|
||||
class NumberList(Grammar):
|
||||
start = "list"
|
||||
|
||||
### Aside: What about a custom DSL/EBNF like thing?
|
||||
@rule
|
||||
def list(self):
|
||||
# The starting rule can't be transparent: there has to be something to
|
||||
# hold on to!
|
||||
return self.transparent_list
|
||||
|
||||
Yeah, OK, there's a rich history of writing your grammar in a domain-specific
|
||||
language. YACC did it, ANTLR does it, GRMTools.... just about everybody except
|
||||
Tree-Sitter does this.
|
||||
@rule(transparent=True)
|
||||
def transparent_list(self) -> Rule:
|
||||
return self.NUMBER | (self.transparent_list + self.COMMA + self.NUMBER)
|
||||
|
||||
But look, I've got several reasons for not doing it.
|
||||
NUMBER = Terminal(Re.set(("0", "9")).plus())
|
||||
COMMA = Terminal(',')
|
||||
```
|
||||
|
||||
First, I'm lazy, and don't want to write yet another parser for my parser. What
|
||||
tools should I use to write my parser generator parser? I guess I don't have my
|
||||
parser generator parser yet, so probably a hand-written top down parser? Some
|
||||
other python parser generator? Ugh!
|
||||
This grammar will generate the far more useful tree:
|
||||
|
||||
As an add-on to that, if I make my own format then I need to make tooling for
|
||||
*that* too: syntax highlighters, jump to definition, the works. Yuck. An
|
||||
existing language, and a format that builds on an existing language, gets me the
|
||||
tooling that comes along with that language. If you can leverage that
|
||||
effictively (and I think I have) then you start way ahead in terms of tooling.
|
||||
```
|
||||
list
|
||||
NUMBER ("1")
|
||||
COMMA
|
||||
NUMBER ("2")
|
||||
COMMA
|
||||
NUMBER ("3")
|
||||
```
|
||||
|
||||
Second, this whole thing is supposed to be easy to include in an existing
|
||||
project, and adding a custom compiler doesn't seem to be that. Adding two python
|
||||
files seems to be about the right speed.
|
||||
Rules that start with `_` are also interpreted as transparent,
|
||||
following the lead set by tree-sitter, and so the grammar above is
|
||||
probably better-written as:
|
||||
|
||||
Thirdly, and this is just hypothetical, it's probably pretty easy to write your
|
||||
own tooling around a grammar if it's already in Python. If you want to make
|
||||
railroad diagrams or EBNF pictures or whatever, all the productions are already
|
||||
right there in data structures for you to process. I've tried to keep them
|
||||
accessible and at least somewhat easy to work with. There's nothing that says a
|
||||
DSL-based system *has* to produce unusable intermediate data- certainly there
|
||||
are some tools that *try*- but with this approach the accessibility and the
|
||||
ergonomics of the tool go hand in hand.
|
||||
```python
|
||||
class NumberList(Grammar):
|
||||
start = "list"
|
||||
|
||||
## Some History
|
||||
@rule
|
||||
def list(self):
|
||||
return self._list
|
||||
|
||||
The first version of this code was written as an idle exercise to learn how LR
|
||||
parser table generation even worked. It was... very simple, fairly easy to
|
||||
follow, and just *incredibly* slow. Like, mind-bogglingly slow. Unusably slow
|
||||
for anything but the most trivial grammar.
|
||||
@rule
|
||||
def _list(self):
|
||||
return self.NUMBER | (self._list + self.COMMA + self.NUMBER)
|
||||
|
||||
As a result, when I decided I wanted to use it for a larger grammar, I found that
|
||||
I just couldn't. So this has been hacked and significantly improved from that
|
||||
version, now capable of building tables for nontrivial grammars. It could still
|
||||
be a lot faster, but it meets my needs for now.
|
||||
NUMBER = Terminal(Re.set(("0", "9")).plus())
|
||||
COMMA = Terminal(',')
|
||||
```
|
||||
|
||||
(BTW, the notes I read to learn how all this works are at
|
||||
http://dragonbook.stanford.edu/lecture-notes/Stanford-CS143/. Specifically,
|
||||
I started with handout 8, 'Bottom-up-parsing', and went from there. (I did
|
||||
eventually have to backtrack a little into handout 7, since that's where
|
||||
First() and Follow() are covered.)
|
||||
That will generate the same tree, but a little more succinctly.
|
||||
|
||||
doty
|
||||
May 2024
|
||||
### Trivia
|
||||
|
||||
Most folks that want to parse something want to skip blanks when they
|
||||
do it. Our grammars don't say anything about that by default (sorry),
|
||||
so you probably want to be explicit about such things.
|
||||
|
||||
To allow (and ignore) spaces, newlines, tabs, and carriage-returns in
|
||||
our number lists, we would modify the grammar as follows:
|
||||
|
||||
```python
|
||||
class NumberList(Grammar):
|
||||
start = "list"
|
||||
trivia = ["BLANKS"] # <- Add a `trivia` member
|
||||
|
||||
@rule
|
||||
def list(self):
|
||||
return self._list
|
||||
|
||||
@rule
|
||||
def _list(self):
|
||||
return self.NUMBER | (self._list + self.COMMA + self.NUMBER)
|
||||
|
||||
NUMBER = Terminal(Re.set(("0", "9")).plus())
|
||||
COMMA = Terminal(',')
|
||||
|
||||
BLANKS = Terminal(Re.set(" ", "\t", "\r", "\n").plus())
|
||||
# ^ and add a new terminal to describe it
|
||||
```
|
||||
|
||||
Now we can parse a list with spaces! "1 , 2, 3" will parse happily
|
||||
into:
|
||||
|
||||
```
|
||||
list
|
||||
NUMBER ("1")
|
||||
COMMA
|
||||
NUMBER ("2")
|
||||
COMMA
|
||||
NUMBER ("3")
|
||||
```
|
||||
|
||||
## Using Grammars
|
||||
|
||||
### Making Parsers and Parsing Text
|
||||
|
||||
Once you have a grammar you can make a parse table from it by
|
||||
constructing an instance of the grammar and calling the `build_table`
|
||||
method on it.
|
||||
|
||||
```python
|
||||
grammar = NumberList()
|
||||
parse_table = grammar.build_table()
|
||||
lexer_table = grammar.compile_lexer()
|
||||
```
|
||||
|
||||
In theory, in the future, you could pass the table to an output
|
||||
generator and it would build a C source file or a Rust source file or
|
||||
something to run the parse. Right now the only runtime is also written
|
||||
in python, so you can do a parse as follows:
|
||||
|
||||
```
|
||||
from parser import runtime
|
||||
|
||||
text = "1,2,3"
|
||||
result, errors = runtime.parse(parse_table, lexer_table, "1,2,3")
|
||||
```
|
||||
|
||||
`result` in the above example will be a concrete syntax tree, if the
|
||||
parse was successful, and `errors` will be a list of error strings
|
||||
from the parse. Note that the python runtime has automatic error
|
||||
recovery (with a variant of
|
||||
[CPCT+](https://tratt.net/laurie/blog/2020/automatic_syntax_error_recovery.html)),
|
||||
so you may get a parse tree even if there were parse errors.
|
||||
|
||||
## Questions
|
||||
|
||||
### Why Python?
|
||||
|
||||
There are a few reasons to use python here.
|
||||
|
||||
First, Python 3 is widely pre-installed on MacOS and Linux. This
|
||||
library requires nothing more than the basic standard library, and not
|
||||
even a new version of it. Therefore, it turns out to be a pretty light
|
||||
dependency for a rust or C++ or some other kind of project, where
|
||||
you're using this to generate the parser tables but the parser itself
|
||||
will be in some other language.
|
||||
|
||||
(Tree-sitter, on the other hand, requires its own standalone binary in
|
||||
addition to node, which is a far less stable and available runtime in
|
||||
2024.)
|
||||
|
||||
I also find the ergonomics of working in python a little nicer than
|
||||
working in, say, JavaScript. Python gives me operator overloading for
|
||||
things like `|` and `+`, which make the rules read a little closer to
|
||||
EBNF for me. It gives me type annotations that work without running a
|
||||
compiler over my input.
|
||||
|
||||
It also *actually raises errors* when I accidentally misspell the name
|
||||
of a rule. And those errors come with the source location of exactly
|
||||
where I made the spelling mistake!
|
||||
|
||||
Finally, I guess you could ask why I'm not using some DSL or something
|
||||
like literally every other parser generator tool except for
|
||||
tree-sitter. And the answer for that is: I just don't care to maintain
|
||||
a parser for my parser generator. ("Yo dawg, I heard you liked
|
||||
parsers...") Python gives me the ability to describe the data I want,
|
||||
in an easy to leverage way, that comes with all the power and
|
||||
flexibility of a general-purpose programming language. Turns out to be
|
||||
pretty nice.
|
||||
|
||||
### What about grammars where blank space is significant, like ... well, python?
|
||||
|
||||
Right now there's no way to describe them natively.
|
||||
|
||||
You could write the grammar and introduce terminals like `INDENT` and
|
||||
`DEDENT` but you would have to write a custom lexer to produce those
|
||||
terminals, and probably handle them differently in all the other uses
|
||||
of the grammar as well.
|
||||
|
||||
That limits the ability to write the grammar once and automatically
|
||||
use it everywhere, but maybe it's good enough for you?
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue