A cleaner API

I've been hacking on this in a different repository, so I thought I'd
bring it over here.
This commit is contained in:
John Doty 2024-04-21 08:07:59 -07:00
parent f656dbd8f3
commit 7c1d9b5f2b
8 changed files with 2902 additions and 2853 deletions

134
README.md
View file

@ -1,18 +1,126 @@
# A collection of LR parser generators, from LR0 through LALR.
One day I read a tweet, asking for a tool which accepted a grammar and an
input file and which then produced simple parsed output, without any kind of
in-between. (There was other ranty stuff about how none of the existing tools
really worked, but that was beside the point.)
This is a small helper library to generate LR parser tables.
Upon reading the tweet, it occured to me that I didn't know how LR parsers
worked and how they were generated, except in the broadest of terms. Thus, I
set about writing this, learning as I went.
The primary inspiration for this library is tree-sitter, which also generates
LR parsers for grammars written in a turing-complete language. Like that, we
write grammars in a language, only we do it in Python instead of JavaScript.
This code is not written to be fast, or even efficient, although it runs its
test cases fast enough. It was instead written to be easy to follow along
with, so that when I forget how all this works I can come back to the code
and read along and learn all over again.
Why Python? Because Python 3 is widely pre-installed on MacOS and Linux. This
library requires nothing more than the basic standard library, and not even a
new version of it. Therefore, it turns out to be a pretty light dependency for
a rust or C++ or something kind of project. (Tree-sitter, on the other hand,
requires node, which is a far less stable and available runtime in 2024.)
The parser tables can really be used to power anything. I prefer to make
concrete syntax trees (again, see tree-sitter), and there is no facility at all
for actions or custom ASTs or whatnot. Any such processing needs to be done by
the thing that processes the tables.
## Making Grammars
To get started, create a grammar that derives from the `Grammar` class. Create
one method per nonterminal, decorated with the `rule` decorator. Here's an
example:
PLUS = Token('+')
LPAREN = Token('(')
RPAREN = Token(')')
ID = Token('id')
class SimpleGrammar(Grammar):
@rule
def expression(self):
return seq(self.expression, PLUS, self.term) | self.term
@rule
def term(self):
return seq(LPAREN, self.expression, RPAREN) | ID
## Using grammars
TODO
## Representation Choices
The SimpleGrammar class might seem a little verbose compared to a dense
structure like:
grammar_simple = [
('E', ['E', '+', 'T']),
('E', ['T']),
('T', ['(', 'E', ')']),
('T', ['id']),
]
or
grammar_simple = {
'E': [
['E', '+', 'T'],
['T'],
],
'T': [
['(', 'E', ')'],
['id'],
],
}
The advantage that the class has over a table like this is that you get to have
all of your Python tools help you make sure your grammar is good, if you want
them. e.g., if you're working with an LSP or something, the members give you
autocomplete and jump-to-definition and possibly even type-checking.
At the very least, if you mis-type the name of a nonterminal, or forget to
implement it, we will immediately raise an error that *INCLUDES THE LOCATION IN
THE SOURCE WHERE THE ERROR WAS MADE.* With tables, we can tell you that you
made a mistake but it's up to you to figure out where you did it.
### Aside: What about a custom DSL/EBNF like thing?
Yeah, OK, there's a rich history of writing your grammar in a domain-specific
language. YACC did it, ANTLR does it, GRMTools.... just about everybody except
Tree-Sitter does this.
But look, I've got several reasons for not doing it.
First, I'm lazy, and don't want to write yet another parser for my parser. What
tools should I use to write my parser generator parser? I guess I don't have my
parser generator parser yet, so probably a hand-written top down parser? Some
other python parser generator? Ugh!
As an add-on to that, if I make my own format then I need to make tooling for
*that* too: syntax highlighters, jump to definition, the works. Yuck. An
existing language, and a format that builds on an existing language, gets me the
tooling that comes along with that language. If you can leverage that
effictively (and I think I have) then you start way ahead in terms of tooling.
Second, this whole thing is supposed to be easy to include in an existing
project, and adding a custom compiler doesn't seem to be that. Adding two python
files seems to be about the right speed.
Thirdly, and this is just hypothetical, it's probably pretty easy to write your
own tooling around a grammar if it's already in Python. If you want to make
railroad diagrams or EBNF pictures or whatever, all the productions are already
right there in data structures for you to process. I've tried to keep them
accessible and at least somewhat easy to work with. There's nothing that says a
DSL-based system *has* to produce unusable intermediate data- certainly there
are some tools that *try*- but with this approach the accessibility and the
ergonomics of the tool go hand in hand.
## Some History
The first version of this code was written as an idle exercise to learn how LR
parser table generation even worked. It was... very simple, fairly easy to
follow, and just *incredibly* slow. Like, mind-bogglingly slow. Unusably slow
for anything but the most trivial grammar.
As a result, when I decided I wanted to use it for a larger grammar, I found that
I just couldn't. So this has been hacked and significantly improved from that
version, now capable of building tables for nontrivial grammars. It could still
be a lot faster, but it meets my needs for now.
(BTW, the notes I read to learn how all this works are at
http://dragonbook.stanford.edu/lecture-notes/Stanford-CS143/. Specifically,
@ -20,7 +128,5 @@ I started with handout 8, 'Bottom-up-parsing', and went from there. (I did
eventually have to backtrack a little into handout 7, since that's where
First() and Follow() are covered.)
Enjoy!
doty
2016-12-09
May 2024