Misplaced error nodes

Consider a parser based on a simple grammar like the following:

@top Program { Variable }
Variable { Dollar Name }
@tokens {
  Dollar { "$" }
  Name { $[a-z]+ }
}

When given $foo as input, the output tree is not particularly interesting:

Program(Variable(Dollar,Name))

However, when the dollar sign is missing (i.e. the input is just “foo”), the error recovery mechanism does not replace missing Dollar with an error node, it puts it outside of Variable and completely “forgets” about Dollar seemingly violating the grammar rules:

Program(⚠,Variable(Name))

This makes it quite tricky to manipulate and reason about the syntax tree. I assume this is a feature. Is there anywhere I can learn more about this behavior?

That node nesting was a bug, and should be fixed by the patch below. But unfortunately, no, there’s no guarantee about tree shape around error nodes—if a node has no error children, it should conform to the grammar (after this fix, at least), but if there are errors inside of it, those may represent any amount of omitted or inserted content.

1 Like

Thanks a lot Marijn for the quick fix! It helps tremendously. Yes, it makes sense there can be even horses eating each other inside malformed nodes, that is expected :slight_smile:

I think I might have found another case where this happens:

@top App { Test+ }
Test { Dollar Name Name }
@tokens {
  Dollar { "$" }
  Name { $[a-z] }
}

With such a grammar:

  • a is “correctly” parsed as App(⚠(Name))
  • aa is “correctly” parsed as App(Test(⚠,Name,Name))
  • aaa is incorrectly parsed as App(⚠(Name),Test(Name,Name))

I guess there is additional logic for cases when a token is unexpected rather that missing. I wonder what should be the most natural syntax tree in this case. I can think of several different options:

  1. App(⚠(Name),Test(⚠,Name,Name))
  2. App(Test(⚠,Name,Name),⚠(Name))
  3. App(Test(⚠(Name),Name,Name))
  4. App(Test(⚠(Name),⚠,Name,Name))
  5. App(Test(⚠,⚠(Name),Name,Name))
  6. App(Test(⚠,Name,Name,⚠(Name)))

I believe 3 is easiest to implement and still makes it fairly straightforward to reason about syntax errors, something in a way of: expected “$” but got “a” (assuming there is a parser on top of Lezer tree). But perhaps 1, 2, or 6 are more natural.

Edit: simplified grammar, added more options.

That was a somewhat different issue — the parser has a heuristic to merge adjacent error nodes, to avoid making too much of a mess in heavily error-corrected parses, and it wasn’t being careful about parent node boundaries there. This patch should help.

1 Like

Thanks, Marijn!