Way to customize error recovery just a bit?

dgreensp · December 12, 2023, 10:33pm

Hi Marijn. I know the official word is, one should not be trying to derive proper error messages from Lezer error nodes (you should write your own parser or use a different parser generator if you need error messages); Lezer’s error recovery is a black box with no knobs, and the rule is, a node with no error children will be well-formed, but for one with error children, all bets are off.

However, I have been using Lezer as the first-stage parser for a new compile-to-JavaScript programming language I am creating, which will have an IDE based on CodeMirror as the primary/intended way to write code in this language. I know how to write parsers—I’ve written recursive descent parsers, a parser combinator library, and one time a complete JavaScript parser—but Lezer actually provides a lot, including error recovery, and it isn’t obvious that writing a parser from scratch or using a different parser generator (and still writing a Lezer grammar for CodeMirror anyway!) is less work for me than mapping trees with errors in them to error messages. In general, it seems that Lezer’s error nodes do point out where the errors are, pretty much as well as any system could, and I have made great progress on reporting syntax errors and formatting/pretty-printing code, even in the presence of other syntax errors, which is something I’d like the IDE to be able to do.

Error recovery being a black box and unpredictable (and not precisely understood by me) does complicate things, especially when it comes to formatting code with errors in it, and wanting formatting to be an idempotent operation (so that formatting twice is the same as formatting once; someone who repeatedly presses the keyboard shortcut for “reformat” should not see the code change and evolve, or oscillate between two states). Even if I were just writing a formatting extension for CodeMirror, I guess I’d just want to confirm, does changing the whitespace/skipped tokens ever change the results of error recovery (and therefore the shape of the tree) at all? In my language’s grammar, Newline is its own skipped token, so a series of skipped tokens might include a sequence like (spaces, Newline, spaces), in case it matters. Formatting could theoretically remove all this whitespace (or insert more), and then the resulting code would be reparsed, with a different set of skipped tokens, but hopefully with the same resulting tree shape (errors and all).

The only other thing that makes Lezer’s trees-with-errors a little bizarre sometimes is when it inserts tokens to make something work in a really forced way, like both the open parenthesis and close parenthesis of a parenthesized expression! My small feature request would be to be able to restrict token insertion in error recovery (such as with a whitelist or blacklist saying what tokens can be inserted). Inserting (hallucinating) a close parenthesis is fantastic. An open parenthesis, not so much. Same with any keyword that introduces a certain statement or expression, like I don’t want the keyword “switch” to be inserted. I don’t want the open quote of a double-quoted string literal to be inserted.

It would be nice to be able to tell error recovery to do less, be less creative, and consider fewer options, if it results in it being more predictable, producing parses that are less weird, and if “all bets are off” anyway when there are errors, why not? Currently, for example, with my JavaScript-like grammar, Lezer treats .... as just four dots someone weirdly left lying around, while ..... is treated as five nested DotExpressions (like foo.bar). That’s fine, but it doesn’t seem like the kind of thing anyone is relying on; I don’t think it could get weirder or worse. Lezer doesn’t use its existing insertion powers to guess that (.bar) is a DotExpression missing the first expression (rather than an expression with a superfluous dot in the middle of it). It doesn’t typically guess where an open parenthesis is meant to go in a program where a random open parenthesis has been deleted, in my testing. So then why occasionally go to town inserting all sorts of things? I wonder what the least intelligent error recovery strategy is that still localizes syntax errors for the most part.

I don’t know how you’d assess the positive or negative impact of a change or a new configuration option here, but just wanted to see what you think.

marijn · December 13, 2023, 8:57am

Token count has an effect on error recovery behavior, so yes, even for a language where whitespace isn’t significant, in some corner cases, adding/removing whitespace tokens could cause error recovery to pick a different parse.

Error recovery works through a pretty dumb search—it tries parse actions in parallel for a bit, assigning a penalty score to each, and then picks the path that seems to lead to the least terrible continued parse. That makes it automatic and very creative, but also very unpredictable and messy.

I’ve been intentionally staying away from requiring grammar authors to add error-recovery-related information to their grammars, because that’d be yet another thing that people writing grammars would need to understand, and another bit of complexity in the grammar definitions.

That being said, it would probably be possible to do something like introducing an annotation on tokens that forbid that token from ever being inserted during error recovery. I’m not sure this will get you much closer to what you are trying to do here though—you also cite some examples of behavior that surprises you that would not be preventable this way.

dgreensp · December 13, 2023, 6:59pm

Thanks, that’s helpful to know! After sleeping on it, I’m going to see about writing my own, tailor-made parser for this. Lezer has been really helpful for getting my language going, and I can still integrate with CodeMirror.

I imagine if I pressed ahead with Lezer I’d be saying similar to things to this guy who tried to use tree-sitter for a similar purpose (basically that the results of error recovery aren’t what a top-down parser would see or what a human would see; they aren’t exactly “good” and you don’t have control over them): Is tree-sitter good enough? – Jake Zimmerman

marijn · December 13, 2023, 8:13pm

You can turn off error recovery entirely with the strict option, if you’re doing something that’d normally be done with a traditional, non-error-correcting parser.

dgreensp · December 13, 2023, 9:18pm

That’s true. I do want error recovery, ideally really good recovery, with good error messages. I just need total control over it, for a variety of reasons, like syntax errors being sufficiently helpful and stable, formatting that satisfies certain invariants, and autocomplete suggestions as you type. The recovery approach of trying skipping or inserting a token at random and continuing parsing doesn’t really resonate with me, and I think I want to explore different possibilities. A way to replace normal error recovery completely in Lezer, or change its behavior to something much simpler, could help, but like you say, it probably won’t get me all the way to where I want.