# A library for grammars This is library to do interesting things with grammars. This was originally built as a little toy for me to understand how LR parser tables worked, but I discovered that what I *really* want is to be able to leverage the grammar to do other things besides parsing. The primary inspiration for this library is tree-sitter, which also generates LR parsers for grammars written in a turing-complete language. Like that, we write grammars in a language, only we do it in Python instead of JavaScript. ## Making Grammars To get started, create a grammar that derives from the `Grammar` class. Create one method per non-terminal, decorated with the `rule` decorator. Here's an example: ```python from parser import * @rule def expression(): return seq(expression, PLUS, term) | term @rule def term(): return seq(LPAREN, expression, RPAREN) | ID PLUS = Terminal('PLUS', '+') LPAREN = Terminal('LPAREN', '(') RPAREN = Terminal('RPAREN', ')') ID = Terminal( 'ID', Re.seq( Re.set(("a", "z"), ("A", "Z"), "_"), Re.set(("a", "z"), ("A", "Z"), ("0", "9"), "_").star(), ), ) SimpleGrammar = Grammar( name="Simple", start=expression, ) ``` Terminals can be plain strings or regular expressions constructed with the `Re` object. (Ironically, I guess this library is not clever enough to parse a regular expression string into one of these structures. If you want to build one, go nuts! It's just Python, you can do whatever you want so long as the result is an `Re` object.) Productions can be built out of terminals and non-terminals, concatenated with the `seq` function or the `+` operator. Alternatives can be expressed with the `alt` function or the `|` operator. These things can be freely nested, as desired. You can make lists in the classic context-free grammar way: ```python @rule def list(): return NUMBER | (list + COMMA + NUMBER) NUMBER = Terminal(Re.set(("0", "9")).plus()) COMMA = Terminal(',') NumberList = Grammar( name="NumberList", start=list, ) ``` (Unlike with PEGs, you can write grammars with left or right-recursion, without restriction, either is fine.) When used to generate a parser, the grammar describes a concrete syntax tree. Unfortunately, that means that the list example above will generate a very awkward tree for `1,2,3`: ``` list list list NUMBER ("1") COMMA NUMBER ("2") COMMA NUMBER ("3") ``` In order to make this a little cleaner, rules can be "transparent", which means they don't generate nodes in the tree and just dump their contents into the parent node instead. ```python @rule def list(): # The starting rule can't be transparent: there has to be something to # hold on to! return transparent_list @rule(transparent=True) def transparent_list() -> Rule: return NUMBER | (transparent_list + COMMA + NUMBER) NUMBER = Terminal(Re.set(("0", "9")).plus()) COMMA = Terminal(',') NumberList = Grammar( name="NumberList", start=list, ) ``` This grammar will generate the far more useful tree: ``` list NUMBER ("1") COMMA NUMBER ("2") COMMA NUMBER ("3") ``` Rules that start with `_` are also interpreted as transparent, following the lead set by tree-sitter, and so the grammar above is probably better-written as: ```python @rule def list(): # The starting rule can't be transparent: there has to be something to # hold on to! return transparent_list @rule def _list() -> Rule: return NUMBER | (_list + COMMA + NUMBER) NUMBER = Terminal(Re.set(("0", "9")).plus()) COMMA = Terminal(',') NumberList = Grammar( name="NumberList", start=list, ) ``` That will generate the same tree, but a little more succinctly. Of course, it's a lot of work to write these transparent recursive rules by hand all the time, so there are helpers that do it for you: ```python @rule def list(): return zero_or_more(NUMBER, COMMA) + NUMBER NUMBER = Terminal(Re.set(("0", "9")).plus()) COMMA = Terminal(',') NumberList = Grammar( name="NumberList", start=list, ) ``` Much better. ### Trivia Most folks that want to parse something want to skip blanks when they do it. Our grammars don't say anything about that by default (sorry), so you probably want to be explicit about such things. To allow (and ignore) spaces, newlines, tabs, and carriage-returns in our number lists, we would modify the grammar as follows: ```python @rule def list(): return zero_or_more(NUMBER, COMMA) + NUMBER NUMBER = Terminal(Re.set(("0", "9")).plus()) COMMA = Terminal(',') BLANKS = Terminal(Re.set(" ", "\t", "\r", "\n").plus()) NumberList = Grammar( name="NumberList", start=list, trivia=[BLANKS], ) ``` Now we can parse a list with spaces! "1 , 2, 3" will parse happily into: ``` list NUMBER ("1") COMMA NUMBER ("2") COMMA NUMBER ("3") ``` ## Using Grammars ### Making Parsers and Parsing Text Once you have a grammar you can make a parse table from it by constructing an instance of the grammar and calling the `build_table` method on it. ```python grammar = NumberList() parse_table = grammar.build_table() lexer_table = grammar.compile_lexer() ``` In theory, in the future, you could pass the table to an output generator and it would build a C source file or a Rust source file or something to run the parse. Right now the only runtime is also written in python, so you can do a parse as follows: ``` from parser import runtime text = "1,2,3" result, errors = runtime.parse(parse_table, lexer_table, "1,2,3") ``` `result` in the above example will be a concrete syntax tree, if the parse was successful, and `errors` will be a list of error strings from the parse. Note that the python runtime has automatic error recovery (with a variant of [CPCT+](https://tratt.net/laurie/blog/2020/automatic_syntax_error_recovery.html)), so you may get a parse tree even if there were parse errors. ## Questions ### Why Python? There are a few reasons to use python here. First, Python 3 is widely pre-installed on MacOS and Linux. This library requires nothing more than the basic standard library, and not even a new version of it. Therefore, it turns out to be a pretty light dependency for a rust or C++ or some other kind of project, where you're using this to generate the parser tables but the parser itself will be in some other language. (Tree-sitter, on the other hand, requires its own standalone binary in addition to node, which is a far less stable and available runtime in 2024.) I also find the ergonomics of working in python a little nicer than working in, say, JavaScript. Python gives me operator overloading for things like `|` and `+`, which make the rules read a little closer to EBNF for me. It gives me type annotations that work without running a compiler over my input. It also *actually raises errors* when I accidentally misspell the name of a rule. And those errors come with the source location of exactly where I made the spelling mistake! Finally, I guess you could ask why I'm not using some DSL or something like literally every other parser generator tool except for tree-sitter. And the answer for that is: I just don't care to maintain a parser for my parser generator. ("Yo dawg, I heard you liked parsers...") Python gives me the ability to describe the data I want, in an easy to leverage way, that comes with all the power and flexibility of a general-purpose programming language. Turns out to be pretty nice. ### What about grammars where blank space is significant, like ... well, python? Right now there's no way to describe them natively. You could write the grammar and introduce terminals like `INDENT` and `DEDENT` but you would have to write a custom lexer to produce those terminals, and probably handle them differently in all the other uses of the grammar as well. That limits the ability to write the grammar once and automatically use it everywhere, but maybe it's good enough for you?