Misplaced error nodes

mikhail · April 21, 2022, 5:56am

Consider a parser based on a simple grammar like the following:

@top Program { Variable }
Variable { Dollar Name }
@tokens {
  Dollar { "$" }
  Name { $[a-z]+ }
}

When given $foo as input, the output tree is not particularly interesting:

Program(Variable(Dollar,Name))

However, when the dollar sign is missing (i.e. the input is just “foo”), the error recovery mechanism does not replace missing Dollar with an error node, it puts it outside of Variable and completely “forgets” about Dollar seemingly violating the grammar rules:

Program(⚠,Variable(Name))

This makes it quite tricky to manipulate and reason about the syntax tree. I assume this is a feature. Is there anywhere I can learn more about this behavior?

marijn · April 21, 2022, 7:57am

That node nesting was a bug, and should be fixed by the patch below. But unfortunately, no, there’s no guarantee about tree shape around error nodes—if a node has no error children, it should conform to the grammar (after this fix, at least), but if there are errors inside of it, those may represent any amount of omitted or inserted content.

mikhail · April 21, 2022, 8:12am

Thanks a lot Marijn for the quick fix! It helps tremendously. Yes, it makes sense there can be even horses eating each other inside malformed nodes, that is expected

mikhail · April 21, 2022, 9:07am

I think I might have found another case where this happens:

@top App { Test+ }
Test { Dollar Name Name }
@tokens {
  Dollar { "$" }
  Name { $[a-z] }
}

With such a grammar:

a is “correctly” parsed as App(⚠(Name))
aa is “correctly” parsed as App(Test(⚠,Name,Name))
aaa is incorrectly parsed as App(⚠(Name),Test(Name,Name))

I guess there is additional logic for cases when a token is unexpected rather that missing. I wonder what should be the most natural syntax tree in this case. I can think of several different options:

App(⚠(Name),Test(⚠,Name,Name))
App(Test(⚠,Name,Name),⚠(Name))
App(Test(⚠(Name),Name,Name))
App(Test(⚠(Name),⚠,Name,Name))
App(Test(⚠,⚠(Name),Name,Name))
App(Test(⚠,Name,Name,⚠(Name)))

I believe 3 is easiest to implement and still makes it fairly straightforward to reason about syntax errors, something in a way of: expected “$” but got “a” (assuming there is a parser on top of Lezer tree). But perhaps 1, 2, or 6 are more natural.

Edit: simplified grammar, added more options.

marijn · April 21, 2022, 3:02pm

That was a somewhat different issue — the parser has a heuristic to merge adjacent error nodes, to avoid making too much of a mess in heavily error-corrected parses, and it wasn’t being careful about parent node boundaries there. This patch should help.

mikhail · April 22, 2022, 1:36am

Thanks, Marijn!