Efficient reuse of productions of the parse tree

grayen · February 21, 2021, 3:04pm

I have implemented my own lightweight markup language and want to give it full and proper language support in editors. At the moment I want support for VS Code, ProseMirror, and CodeMirror. So I went and manually (i.e. no generator used) implemented the language in Lezer to produce a parse tree. From there I can transform it to whatever I need, e.g. in the case of VS Code that could be semantic tokens (i.e. syntax highlighting) and folding ranges. However I started to see a pattern. If I go about naively implemented those features, I can just traverse the full parse tree and produce whatever I need, whenever it is requested. Then I realized you can combine these traversals into one traversal, using async/promises. And then I realized that in most cases only a small change occurred, and I know exactly what part of the parse tree changed in turn, so it is rather inefficient to reproduce the whole thing. In case of big documents, this could become significant. This holds for parsing itself (in case of Lezer we can use TreeFragments), but also for the semantic tokens, folding ranges, document links, etc. Unfortunately, as I noticed with implementing reuse in the parser, it is definitely non-trivial to get it correct, making sure everything is updated exactly as it would be as when a full reparse would have taken place. I wonder how others approach this. When I think of all corrections I would need to do to properly reuse, I feel like a full reproduction might sometimes actually be faster outside maybe memory usage (i.e. just recalculate all folding ranges, even if only one is affected).

At the moment I am just going to only do full reproductions (but combine them in one traversal), but still reuse the parse tree where possible with TreeFragments. However I would like to go the extra step and learn of ways how to have those reproductions be recalculated more efficiently. If anybody has examples of how others do this, or other pointers, I love to hear them!

marijn · February 21, 2021, 3:22pm

It’s also been my experience that this stuff is tricky. For highlighting, I concluded that just re-generating the decorations for the whole viewport whenever something relevant changes was easier than tracking unchanged regions and updating the existing decorations minimally. But I’m sure there are situations, such as those that require one to work from the entire tree instead of just the part that touches the viewport, where doing things incrementally would be necessary. One possible trick there, if your result value can be calculated separately for (some) nodes, and the computation is structured as a reduction over such values for child nodes, might be to keep a weakmap from Tree instances to result values to memoize the work.

grayen · February 21, 2021, 4:25pm

Thanks for the quick response! Its good (and bad, as I would have actually preferred feeling stupid for missing something obvious ) to hear it confirmed that its not already considered a solved problem. I also considered using WeakMap, considering your use of it in your Markdown implementation for Lezer, so I guess I will try and see how far I can get using that. It will probably still remain somewhat tricky to really generalize this, as different calculations might need to consider different scopes and such.