Lezer to AST

grayen · June 28, 2024, 12:37pm

I made a few language extensions to TypeScript using Babel, but Babel turned out not to be a great fit for this, as it is solely focused on the JavaScript ecosystem, and not meant for custom languages. It also does not have any parse error recovery, it can only recover errors based on analysing the succesfully parsed tree, like having a redefined const, making using it as part of a language service (editor support) a bad fit. As this would mean e.g. not having auto complete at object.|, where | is the cursor, as Babel fails to parse this, and thus there is no parse tree to determine the auto completion on. In my use case, there is no parse tree to transform to valid TypeScript that can be further processed by the TypeScript language service.

As far as I know Lezer is the only parser with parse error recovery as a JavaScript library, without resorting to WASM, that is actively maintained. And as it is designed for use in a code editor, it is a great fit for language support. I know Lezer stops at parse trees, as that is all that is needed for its intended use case. However, since all alternatives I could find that do provide support for abstract syntax trees per design are either unmaintained or missing some of those crucial features for good language service support, I plan to use Lezer as my basis.

I can construct a basic abstract tree using the the original input and iterating over the parse tree. However for the abstract tree to be really useful, it would need named children, like being able to refer to the left and right expression in a binary expression. That example can be easily dealt with using some additional metadata that a binary expression should have 2 expression children and 1 operator child, and that the first expression is left, and second right, but I can think of more complex examples where it would be undecidable without knowing the alternatives chosen. For showcase purposes, say we add the syntax to name children with <name>:, and we have something like this:

BinaryExpression { left:Expression Operator right:Expression | "flip" right:Expression Operator left:Expression }

Where the "flip" alternative would allow you to write a binary expression with the expressions flipped. Given either alternative, my Lezer parse tree would only know it is a BinaryExpression and that is has the children [Expression, Operator, Expression], not giving me enough information to determine which Expression should be named left or right.

Is this information present in the Lezer parse tree? The most obvious answer would be to inspect the tokens and determine the alternatives that way, like checking for the "flip" token, but that would require me to basically reparse the parse tree again, which feels wrong, redoing part of the work Lezer already did.

Another option would be to make sure that whenever the parse tree becomes ambiguous like this, to rewrite the grammer to add additional rules to make these distinctions.

Is there some obvious solution I am missing?

Or is using Lezer for things like this simply not a good fit? If so, do you know any alternatives?

marijn · June 28, 2024, 1:51pm

No. The only way to get access to it is to read the range of the input document covered by the tokens.

Possibly the acorn-loose package is of interest to you.

AlexErrant · June 28, 2024, 2:37pm

Just wondering; what WASM libraries exist? I know Tree-sitter can, but googling “wasm parser generator” doesn’t yield much, and AFAIK crates.io doesn’t let you filter by “compiles-to-wasm”. (Adding “wasm” to the search also yields questionable results.)

grayen · July 1, 2024, 7:06am

I wasn’t really thinking of specific ones, just that the possibility exists to compile languages such as C++ and Rust to WASM. Maybe not the parser generator itself, but the generated outputs should be easy to compile to WASM (unlikely to have dependencies that complicate things). For potential parser generators there is the list on Wikipedia for example.

grayen · July 1, 2024, 7:45am

I expected as much, but good to have it confirmed. I did look at alternatives like Lark.js and the JavaScript runtime for ANTLR. However Lark is Python first, the JS version is second class (compiled from the Python version) and you have to implement an error recovery strategy yourself (from what I understand). And ANTLR does not have incremental parsing and unlike Lezer, it’s language grammars do not have active users like Lezer has with CodeMirror. It’s TypeScript grammar for example is extremely outdated.

I could make ANTLR work, but I think it would be less effort to have Lezer support my AST needs than to add incremental parsing to ANTLR. And the JavaScript runtime of ANTLR is unlikely to be used much, while Lezer is actively used, so I feel more confident that it will remain maintained.

Thanks for the acorn-loose suggestion, but I am afraid it is like Babel in that it is too much focused on JavaScript, and some of my language extensions are unlike anything in JavaScript. And if I have to hack the invalid parse tree in something workable, I might just as well put that time getting Lezer to produce an AST, which feels like the more stable solution.

I currently plan to look at all child tokens and do sort of smart pattern match to determine the alternative (looking at min/max length of alternatives compared to the list of children and checking for the presence of literal tokens). I can get the structure of alternatives by using your Lezer grammar parser.