Lezer 0.15.0 release

marijn · August 11, 2021, 11:13am

This took way longer than I hoped, but I’ve just tagged version 0.15.0 of the Lezer packages that I maintain. This is an incompatible upgrade, though unless you’re maintaining a grammar with non-trivial external tokenizers or use nested parsing it should be very easy to upgrade.

You can see the full change log on the website.

Mixed-language parsing

The main thing that the new version brings is a totally different approach to mixed-language parsing. In the previous version, parser implementations (i.e. the LR parser and the Markdown parser) were responsible for managing nested languages inside their text. The way the LR parser did this was by eagerly scanning for some end token (say </script>) when it saw the start of a nested range, and then running the inner parser on the text up to that end token. That worked well for HTML and similar simple cases, but not for much else, and also required a full re-scan of the nested range (which might be, say, a 2mb <style> tag) every time something in it changed.

But in lots of situations, such as templating languages, or even Markdown code blocks with blockquote > markers inside of them, what you need is to run another parser on some complicated set of subranges of the outer parse.

The main problem with this is that it becomes really tricky to represent the resulting parse as a tree. I.e. in the hypothetical template snippet <div>a<?if x?></div><div>b<?/if?></div>, the structure of both languages does not nest, so there’s no single hierarchical tree that can represent that document.

I ended up with the slightly awkward but generally workable concept of ‘mounted’ trees, which allows other-language trees to be attached to the nodes of an outer tree, either replacing the entire node they are mounted on, such as a JS script replacing the text inside a <script> tag, or overlaying it, which creates a situation where there are two parallel active trees inside the mount target—the original tree itself, and, for some of the ranges inside it, the overlay tree.

So for the templating example above, you’d parse the document using the template language, getting an output like Template(Content,If(IfTag,Content,EndIfTag),Content). Then you’d parse all the regions covered by Content nodes as HTML, and attach the resulting tree to the Template root along with information about the ranges it covers. (For something like a JS template literal, you could do something similar, but target only the TemplateString node).

Having both the ‘root’ and the ‘mounted’ tree preserved also has advantages for incremental parsing—for nested regions that might be large, grammars can emit some kind of repeat structure in their tree (by, for example, making each line a token, rather than the whole region a single token), which can then be used when re-parsing to quickly recreate the node that covers the region, without re-scanning the entire thing.

Instead of making each parser implementation responsible for handling its nesting, Lezer 0.15.0 defers mixed-language parsing to the end of the parse, when the tree is available. This makes the partial parse interface slightly less straightforward, since the parse no longer proceeds in a single pass, but removes a whole lot of awkward coupling and allows mixed parsing to be defined in terms of tree nodes, rather than living in some strange space during the parse, where surrounding nodes aren’t available yet.

Package names

Since the lezer-tree package has, for a while, also exposed abstract types related to parsing, not just trees, and lezer is actually just the LR parser implementation, I decided to use this breaking change to rename them.

I also went ahead and moved to a @lezer package scope, to align it with what CodeMirror 6 is doing and for general aesthetic pleasantness.

lezer-tree is now @lezer/common
lezer is now @lezer/lr
lezer-generator is now @lezer/generator
The lezer-[language] packages are now @lezer/[language]

Lookahead bug

There was a rather insidious bug around incremental parsing and tokenizer lookahead. Essentially, if a tokenizer looked ahead beyond the token that it eventually produced (for example a block comment tokenizer giving up when not finding the closing */ marker, and then ending up tokenizing a division operator), that created a dependency on all parts of the input that the tokenizer looked at, and this was not properly tracked. When a later document update added the closing marker, an incremental re-parse might end up reusing the division operator and produce the wrong output.

Version 0.15.0 properly tracks lookahead, but in order to do that it had to overhaul the way external tokenizers are written, both in order to observe lookahead and in order to avoid the API inviting it. Thus, the input stream abstraction passed to tokenizers changed (entirely), and you’ll have to adjust your external tokenizer code to it.

andreypopp · September 16, 2021, 12:20pm

I’m trying to update the Julia parser which extensively uses external tokenisers and it seems to me that the interaction between skipped tokens and external tokenisers has changed with 0.15.0?

Previously external tokenisers had a chance to interpret the input before skip rules were fired but now it seems like skip rules go first and only then external tokenisers have a chance to run?

The relevant fragment of a grammar is

@skip { whitespace | Comment | BlockComment }
...
expressionList<e, pe> {
  (e ~id | BareTupleExpression<e> | AssignmentExpression<e> | FunctionAssignmentExpression)
  (!regular0 terminator
    (!regular0 e | BareTupleExpression<e> | AssignmentExpression<e> | FunctionAssignmentExpression))*
  terminator?
}
...
@external tokens terminator from "./index.tokens.js" { terminator }

The terminator tokeniser looks like this after updating to 0.15.0 API:

export const terminator = new ExternalTokenizer((input, stack) => {
  let curr = input.peek(input.pos);
  if (curr === CHAR_NEWLINE || curr === CHAR_SEMICOLON) {
    if (stack.canShift(terms.terminator)) {
      input.acceptToken(terms.terminator, input.pos + 1);
      return;
    }
  }
});

What I see is that terminator now triggers at the position set after the newline (so I figured out it was processed with skip rule).

The desired behaviour is: I want to ignore newlines everywhere but only in places where I explicitly seek for a terminator token.

marijn · September 16, 2021, 12:45pm

Skip rules don’t ‘fire’ before tokenization—tokenization happens first, and skip rules only take effect if the token they start with is produced by that process. So it sounds like you are working with incorrect assumptions.

andreypopp · September 16, 2021, 12:59pm

Ok, I think I misunderstood that in peek(offset) offset is relative to input.pos not start of the input. Sorry for the noise.