Hi Marijn. I know the official word is, one should not be trying to derive proper error messages from Lezer error nodes (you should write your own parser or use a different parser generator if you need error messages); Lezer’s error recovery is a black box with no knobs, and the rule is, a node with no error children will be well-formed, but for one with error children, all bets are off.
Error recovery being a black box and unpredictable (and not precisely understood by me) does complicate things, especially when it comes to formatting code with errors in it, and wanting formatting to be an idempotent operation (so that formatting twice is the same as formatting once; someone who repeatedly presses the keyboard shortcut for “reformat” should not see the code change and evolve, or oscillate between two states). Even if I were just writing a formatting extension for CodeMirror, I guess I’d just want to confirm, does changing the whitespace/skipped tokens ever change the results of error recovery (and therefore the shape of the tree) at all? In my language’s grammar, Newline is its own skipped token, so a series of skipped tokens might include a sequence like (spaces, Newline, spaces), in case it matters. Formatting could theoretically remove all this whitespace (or insert more), and then the resulting code would be reparsed, with a different set of skipped tokens, but hopefully with the same resulting tree shape (errors and all).
The only other thing that makes Lezer’s trees-with-errors a little bizarre sometimes is when it inserts tokens to make something work in a really forced way, like both the open parenthesis and close parenthesis of a parenthesized expression! My small feature request would be to be able to restrict token insertion in error recovery (such as with a whitelist or blacklist saying what tokens can be inserted). Inserting (hallucinating) a close parenthesis is fantastic. An open parenthesis, not so much. Same with any keyword that introduces a certain statement or expression, like I don’t want the keyword “switch” to be inserted. I don’t want the open quote of a double-quoted string literal to be inserted.
.... as just four dots someone weirdly left lying around, while
..... is treated as five nested DotExpressions (like
foo.bar). That’s fine, but it doesn’t seem like the kind of thing anyone is relying on; I don’t think it could get weirder or worse. Lezer doesn’t use its existing insertion powers to guess that
(.bar) is a DotExpression missing the first expression (rather than an expression with a superfluous dot in the middle of it). It doesn’t typically guess where an open parenthesis is meant to go in a program where a random open parenthesis has been deleted, in my testing. So then why occasionally go to town inserting all sorts of things? I wonder what the least intelligent error recovery strategy is that still localizes syntax errors for the most part.
I don’t know how you’d assess the positive or negative impact of a change or a new configuration option here, but just wanted to see what you think.