This took way longer than I hoped, but I’ve just tagged version 0.15.0 of the Lezer packages that I maintain. This is an incompatible upgrade, though unless you’re maintaining a grammar with non-trivial external tokenizers or use nested parsing it should be very easy to upgrade.
You can see the full change log on the website.
The main thing that the new version brings is a totally different approach to mixed-language parsing. In the previous version, parser implementations (i.e. the LR parser and the Markdown parser) were responsible for managing nested languages inside their text. The way the LR parser did this was by eagerly scanning for some end token (say
</script>) when it saw the start of a nested range, and then running the inner parser on the text up to that end token. That worked well for HTML and similar simple cases, but not for much else, and also required a full re-scan of the nested range (which might be, say, a 2mb
<style> tag) every time something in it changed.
But in lots of situations, such as templating languages, or even Markdown code blocks with blockquote
> markers inside of them, what you need is to run another parser on some complicated set of subranges of the outer parse.
The main problem with this is that it becomes really tricky to represent the resulting parse as a tree. I.e. in the hypothetical template snippet
<div>a<?if x?></div><div>b<?/if?></div>, the structure of both languages does not nest, so there’s no single hierarchical tree that can represent that document.
I ended up with the slightly awkward but generally workable concept of ‘mounted’ trees, which allows other-language trees to be attached to the nodes of an outer tree, either replacing the entire node they are mounted on, such as a JS script replacing the text inside a
<script> tag, or overlaying it, which creates a situation where there are two parallel active trees inside the mount target—the original tree itself, and, for some of the ranges inside it, the overlay tree.
So for the templating example above, you’d parse the document using the template language, getting an output like
Template(Content,If(IfTag,Content,EndIfTag),Content). Then you’d parse all the regions covered by
Content nodes as HTML, and attach the resulting tree to the
Template root along with information about the ranges it covers. (For something like a JS template literal, you could do something similar, but target only the
Having both the ‘root’ and the ‘mounted’ tree preserved also has advantages for incremental parsing—for nested regions that might be large, grammars can emit some kind of repeat structure in their tree (by, for example, making each line a token, rather than the whole region a single token), which can then be used when re-parsing to quickly recreate the node that covers the region, without re-scanning the entire thing.
Instead of making each parser implementation responsible for handling its nesting, Lezer 0.15.0 defers mixed-language parsing to the end of the parse, when the tree is available. This makes the partial parse interface slightly less straightforward, since the parse no longer proceeds in a single pass, but removes a whole lot of awkward coupling and allows mixed parsing to be defined in terms of tree nodes, rather than living in some strange space during the parse, where surrounding nodes aren’t available yet.
lezer-tree package has, for a while, also exposed abstract types related to parsing, not just trees, and
lezer is actually just the LR parser implementation, I decided to use this breaking change to rename them.
I also went ahead and moved to a
@lezer package scope, to align it with what CodeMirror 6 is doing and for general aesthetic pleasantness.
lezer-[language]packages are now
There was a rather insidious bug around incremental parsing and tokenizer lookahead. Essentially, if a tokenizer looked ahead beyond the token that it eventually produced (for example a block comment tokenizer giving up when not finding the closing
*/ marker, and then ending up tokenizing a division operator), that created a dependency on all parts of the input that the tokenizer looked at, and this was not properly tracked. When a later document update added the closing marker, an incremental re-parse might end up reusing the division operator and produce the wrong output.
Version 0.15.0 properly tracks lookahead, but in order to do that it had to overhaul the way external tokenizers are written, both in order to observe lookahead and in order to avoid the API inviting it. Thus, the input stream abstraction passed to tokenizers changed (entirely), and you’ll have to adjust your external tokenizer code to it.