Is it OK to create toy language using lezer? any example?

dinfer · October 24, 2024, 7:01am

I’m wandering is it ok to use lezer to create a toy language. Did anyone create one? Is there some example?
I did some search, there are some other parser generators

pegjs
ohm-js
…
And lezer is special because I’m planning use codemirror as an online editor. The grammar can be directly reused to highlight the code.

marijn · October 24, 2024, 8:01am

That works, though you must be aware that the kind of trees Lezer produces are a bit more minimal and awkward than what a classical parser will give you (they are concrete syntax trees, not abstract ones). In this thread someone describes running another pass on such a tree to make it more practical and abstract.

grayen · October 24, 2024, 12:50pm

I spent too much time trying to make that approach work, but in practice, for such purposes, Lezer only gives you validation (having succeeded parsing) and the tokens. To actually turn it into an AST, you end up having to re-implement most of the parser. And even though you know there was a valid parse and you already have the correct tokens, to disambiguate where the token belongs in your AST node, in which field, you will have to implement all the machinery of the LR parser again, like backtracking. I got stuck making my CST → AST convert when a few grammars seemed to be undecidable or really hard to implement the necessary backtracking for, and ended up giving up on the approach. Like optional tokens consuming tokens that were later required, making my AST re-parser fail when it should have given the token to the required part. Another painful one is that of literals that are not nodes, this information is lost, so I can only assume they succeeded to parse, given Lezer succeeded a parse, but this could lead me to assume the wrong alternative in a choice @top Top { "a" "x" | "b" "x" } @tokens { "x" }, which is undecidable given the output of Lezer. There is no difference between the parse result of "ax" and "bx". I now use tree-sitter (compiled to WASM), which Lezer is based on, but which does allow annotating parts of the tree with field names, so making an AST becomes trivial.

marijn · October 24, 2024, 1:06pm

This doesn’t sound right. If you structure your grammar well, the conversion to an abstract tree is a lot less work than parsing from scratch, and definitely does not involve any backtracking.

grayen · October 24, 2024, 1:31pm

Here is the simplest example of where I do need backtracking:

@top Foo { "a"? "a" } @tokens { "a" }

I get the following Lezer tree with input a:

Foo [1:0..1:1]
┗━  a [1:0..1:1]: "a"

But in a AST re-parser I would need to check both branches of "a"? consuming token “a” and not consuming it, to find a valid result.

I tried before to emulate fields by just wrapping the parser with additional nodes where I would want fields, but this caused otherwise valid Lezer parsers to become ambiguous, so I assumed I cannot simply naively add nodes to structure my grammar as needed. A simple test:

@top Foo { A? B } A { "a" } B { "a" }

Does work correctly though, so it might be worth trying again to see if it cannot be solved after all.

The undecidable issue, not knowing whether the "a" "x" or "b" "x" branch was taken, could be resolved by simply requiring all tokens to be nodes.

grayen · October 24, 2024, 1:47pm

I managed to find back the experiment I used at the time that made me think just rewriting my grammar to include more nodes to disambiguate where necessary would not work:

reduce/reduce conflict between
  BinaryExpression_Left -> expression
and
  BinaryExpression_Right -> expression
With input:
  BinaryExpression_Left BinaryExpression_Operator expression · PlusToken …
Shared origin: expression -> · BinaryExpression
  via BinaryExpression -> · BinaryExpression_Left BinaryExpression_Operator BinaryExpression_Right
    BinaryExpression_Left -> expression ·
  via BinaryExpression -> BinaryExpression_Left BinaryExpression_Operator · BinaryExpression_Right
    BinaryExpression_Right -> expression ·

reduce/reduce conflict between
  BinaryExpression_Left-1 -> expression
and
  BinaryExpression_Right-1 -> expression
With input:
  BinaryExpression_Left-1 BinaryExpression_Operator-1 expression · MinusToken …
Shared origin: expression -> · BinaryExpression
  via BinaryExpression -> · BinaryExpression_Left-1 BinaryExpression_Operator-1 BinaryExpression_Right-1
    BinaryExpression_Left-1 -> expression ·
  via BinaryExpression -> BinaryExpression_Left-1 BinaryExpression_Operator-1 · BinaryExpression_Right-1
    BinaryExpression_Right-1 -> expression ·

With the following grammar:

@precedence { plus @left, minus @left }

@top SourceFile {
  ExpressionStatement { expression }
}

expression { ParenthesizedExpression | BinaryExpression | NumericLiteral }

ParenthesizedExpression { "(" expression ")" }

BinaryExpression {
  BinaryExpression_Left { expression } !plus BinaryExpression_Operator { PlusToken } BinaryExpression_Right { expression } |
  BinaryExpression_Left { expression } !minus BinaryExpression_Operator { MinusToken } BinaryExpression_Right { expression }
}

@skip { space }

@tokens {
  "("
  ")"
  space { @whitespace+ }
  NumericLiteral { @digit+ }
  PlusToken { "+" }
  MinusToken { "-" }
}

Though this is likely due to my lack of experience with LR grammars and the gotchas with it. So there likely are ways to restructure it to make it parse, but it at least indicated for me that a simple rewrite of a big existing grammar, like that of JS, was unlikely to work.

marijn · October 24, 2024, 2:53pm

Right, there are situations where adding wrapping nodes will make the grammar non-LR(1). But getChild should, at least in case of the example, be able to pick out the nodes you’re looking for without any difficult custom code.