A cleaner API

I've been hacking on this in a different repository, so I thought I'd bring it over here.
2024-04-21 08:07:59 -07:00 · 2024-04-21 08:07:59 -07:00 · 7c1d9b5f2b
commit 7c1d9b5f2b
parent f656dbd8f3
8 changed files with 2902 additions and 2853 deletions
--- a/README.md
+++ b/README.md
@ -1,18 +1,126 @@
 # A collection of LR parser generators, from LR0 through LALR.

-One day I read a tweet, asking for a tool which accepted a grammar and an
-input file and which then produced simple parsed output, without any kind of
-in-between. (There was other ranty stuff about how none of the existing tools
-really worked, but that was beside the point.)
+This is a small helper library to generate LR parser tables.

-Upon reading the tweet, it occured to me that I didn't know how LR parsers
-worked and how they were generated, except in the broadest of terms. Thus, I
-set about writing this, learning as I went.
+The primary inspiration for this library is tree-sitter, which also generates
+LR parsers for grammars written in a turing-complete language. Like that, we
+write grammars in a language, only we do it in Python instead of JavaScript.

-This code is not written to be fast, or even efficient, although it runs its
-test cases fast enough. It was instead written to be easy to follow along
-with, so that when I forget how all this works I can come back to the code
-and read along and learn all over again.
+Why Python? Because Python 3 is widely pre-installed on MacOS and Linux. This
+library requires nothing more than the basic standard library, and not even a
+new version of it. Therefore, it turns out to be a pretty light dependency for
+a rust or C++ or something kind of project. (Tree-sitter, on the other hand,
+requires node, which is a far less stable and available runtime in 2024.)
+
+The parser tables can really be used to power anything. I prefer to make
+concrete syntax trees (again, see tree-sitter), and there is no facility at all
+for actions or custom ASTs or whatnot. Any such processing needs to be done by
+the thing that processes the tables.
+
+## Making Grammars
+
+To get started, create a grammar that derives from the `Grammar` class. Create
+one method per nonterminal, decorated with the `rule` decorator. Here's an
+example:
+
+    PLUS = Token('+')
+    LPAREN = Token('(')
+    RPAREN = Token(')')
+    ID = Token('id')
+
+    class SimpleGrammar(Grammar):
+        @rule
+        def expression(self):
+            return seq(self.expression, PLUS, self.term) | self.term
+
+        @rule
+        def term(self):
+            return seq(LPAREN, self.expression, RPAREN) | ID
+
+
+## Using grammars
+
+TODO
+
+## Representation Choices
+
+The SimpleGrammar class might seem a little verbose compared to a dense
+structure like:
+
+    grammar_simple = [
+        ('E', ['E', '+', 'T']),
+        ('E', ['T']),
+        ('T', ['(', 'E', ')']),
+        ('T', ['id']),
+    ]
+
+or
+
+    grammar_simple = {
+      'E': [
+          ['E', '+', 'T'],
+          ['T'],
+      ],
+      'T': [
+          ['(', 'E', ')'],
+          ['id'],
+      ],
+    }
+
+
+The advantage that the class has over a table like this is that you get to have
+all of your Python tools help you make sure your grammar is good, if you want
+them. e.g., if you're working with an LSP or something, the members give you
+autocomplete and jump-to-definition and possibly even type-checking.
+
+At the very least, if you mis-type the name of a nonterminal, or forget to
+implement it, we will immediately raise an error that *INCLUDES THE LOCATION IN
+THE SOURCE WHERE THE ERROR WAS MADE.* With tables, we can tell you that you
+made a mistake but it's up to you to figure out where you did it.
+
+### Aside: What about a custom DSL/EBNF like thing?
+
+Yeah, OK, there's a rich history of writing your grammar in a domain-specific
+language. YACC did it, ANTLR does it, GRMTools.... just about everybody except
+Tree-Sitter does this.
+
+But look, I've got several reasons for not doing it.
+
+First, I'm lazy, and don't want to write yet another parser for my parser. What
+tools should I use to write my parser generator parser? I guess I don't have my
+parser generator parser yet, so probably a hand-written top down parser? Some
+other python parser generator? Ugh!
+
+As an add-on to that, if I make my own format then I need to make tooling for
+*that* too: syntax highlighters, jump to definition, the works. Yuck. An
+existing language, and a format that builds on an existing language, gets me the
+tooling that comes along with that language. If you can leverage that
+effictively (and I think I have) then you start way ahead in terms of tooling.
+
+Second, this whole thing is supposed to be easy to include in an existing
+project, and adding a custom compiler doesn't seem to be that. Adding two python
+files seems to be about the right speed.
+
+Thirdly, and this is just hypothetical, it's probably pretty easy to write your
+own tooling around a grammar if it's already in Python. If you want to make
+railroad diagrams or EBNF pictures or whatever, all the productions are already
+right there in data structures for you to process. I've tried to keep them
+accessible and at least somewhat easy to work with. There's nothing that says a
+DSL-based system *has* to produce unusable intermediate data- certainly there
+are some tools that *try*- but with this approach the accessibility and the
+ergonomics of the tool go hand in hand.
+
+## Some History
+
+The first version of this code was written as an idle exercise to learn how LR
+parser table generation even worked. It was... very simple, fairly easy to
+follow, and just *incredibly* slow. Like, mind-bogglingly slow. Unusably slow
+for anything but the most trivial grammar.
+
+As a result, when I decided I wanted to use it for a larger grammar, I found that
+I just couldn't. So this has been hacked and significantly improved from that
+version, now capable of building tables for nontrivial grammars. It could still
+be a lot faster, but it meets my needs for now.

 (BTW, the notes I read to learn how all this works are at
 http://dragonbook.stanford.edu/lecture-notes/Stanford-CS143/. Specifically,
@ -20,7 +128,5 @@ I started with handout 8, 'Bottom-up-parsing', and went from there. (I did
 eventually have to backtrack a little into handout 7, since that's where
 First() and Follow() are covered.)

-Enjoy!
-
 doty
-2016-12-09
+May 2024