Hi Marijn. I know the official word is, one should not be trying to derive proper error messages from Lezer error nodes (you should write your own parser or use a different parser generator if you need error messages); Lezer’s error recovery is a black box with no knobs, and the rule is, a node with no error children will be well-formed, but for one with error children, all bets are off.
However, I have been using Lezer as the first-stage parser for a new compile-to-JavaScript programming language I am creating, which will have an IDE based on CodeMirror as the primary/intended way to write code in this language. I know how to write parsers—I’ve written recursive descent parsers, a parser combinator library, and one time a complete JavaScript parser—but Lezer actually provides a lot, including error recovery, and it isn’t obvious that writing a parser from scratch or using a different parser generator (and still writing a Lezer grammar for CodeMirror anyway!) is less work for me than mapping trees with errors in them to error messages. In general, it seems that Lezer’s error nodes do point out where the errors are, pretty much as well as any system could, and I have made great progress on reporting syntax errors and formatting/pretty-printing code, even in the presence of other syntax errors, which is something I’d like the IDE to be able to do.
Error recovery being a black box and unpredictable (and not precisely understood by me) does complicate things, especially when it comes to formatting code with errors in it, and wanting formatting to be an idempotent operation (so that formatting twice is the same as formatting once; someone who repeatedly presses the keyboard shortcut for “reformat” should not see the code change and evolve, or oscillate between two states). Even if I were just writing a formatting extension for CodeMirror, I guess I’d just want to confirm, does changing the whitespace/skipped tokens ever change the results of error recovery (and therefore the shape of the tree) at all? In my language’s grammar, Newline is its own skipped token, so a series of skipped tokens might include a sequence like (spaces, Newline, spaces), in case it matters. Formatting could theoretically remove all this whitespace (or insert more), and then the resulting code would be reparsed, with a different set of skipped tokens, but hopefully with the same resulting tree shape (errors and all).
The only other thing that makes Lezer’s trees-with-errors a little bizarre sometimes is when it inserts tokens to make something work in a really forced way, like both the open parenthesis and close parenthesis of a parenthesized expression! My small feature request would be to be able to restrict token insertion in error recovery (such as with a whitelist or blacklist saying what tokens can be inserted). Inserting (hallucinating) a close parenthesis is fantastic. An open parenthesis, not so much. Same with any keyword that introduces a certain statement or expression, like I don’t want the keyword “switch” to be inserted. I don’t want the open quote of a double-quoted string literal to be inserted.
It would be nice to be able to tell error recovery to do less, be less creative, and consider fewer options, if it results in it being more predictable, producing parses that are less weird, and if “all bets are off” anyway when there are errors, why not? Currently, for example, with my JavaScript-like grammar, Lezer treats ....
as just four dots someone weirdly left lying around, while .....
is treated as five nested DotExpressions (like foo.bar
). That’s fine, but it doesn’t seem like the kind of thing anyone is relying on; I don’t think it could get weirder or worse. Lezer doesn’t use its existing insertion powers to guess that (.bar)
is a DotExpression missing the first expression (rather than an expression with a superfluous dot in the middle of it). It doesn’t typically guess where an open parenthesis is meant to go in a program where a random open parenthesis has been deleted, in my testing. So then why occasionally go to town inserting all sorts of things? I wonder what the least intelligent error recovery strategy is that still localizes syntax errors for the most part.
I don’t know how you’d assess the positive or negative impact of a change or a new configuration option here, but just wanted to see what you think.