Some thoughts (and requests) about how CM6 schedules parsing

Monkatraz · March 29, 2021, 6:32am

I’ve been working on a complete rewrite of my cm6-monarch extension, and it differs to the point I don’t feel like it should really be called monarch anymore. Regardless, this experience has made me desire some additions to CM6 / Lezer.

Just to give an idea of my architecture / use-case here, due to it being relevant later (and maybe interesting to you), here is some key-points:

The parser runs a tokenizer -> parser loop. The tokenizer is given a string, some positional data, and it returns a list of tokens. The tokens themselves have parser “directives” attached to them, which informs the parser about nesting. It’s similar to a Textmate grammar - except in reverse. A node is only valid when it closes, rather than eagerly matched when it opens.
The parser outputs tokens into a flat “buffer”, like how I believe Lezer does, although in this case there is only a single “stack”, as there is only ever one route the tokenizer/parser can take. This buffer is found by association with the current syntax tree, like how the StreamLanguage does it.
The buffer consists of two objects, many BufferTokens, and an even distribution of Checkpoints.
All BufferTokens are document-positionally relative to the last Checkpoint - and all Checkpoints are relative to the previous Checkpoint, with the exception with the very first Checkpoint.

What that last point means is that subtly adjusting the buffer - which is the source of data that is ultimately used in the final Tree - is really easy. You can simply move a relevant Checkpoint, and all Checkpoints infront of it move forwards/backwards with it.

Making this system has made me want two things in particular:

A way to asynchronously parse. This can either take the form of actual plain old Promise, or it can be some sort of function that allows a parser extension to inform CM6 that it wants the parse worker to schedule another parse some time in the future.
A way to tell Tree.build that a node is to be “skipped”, or “filler”. This is so that my flat buffer doesn’t have to be carefully rebuilt if I “compress” it. What I mean by that is, is that a BufferToken can have a Tree property, and if there, it will be output as a reused node. The issue arises from the fact that doing this should cause all of its children to not be output. Preserving those children nodes is useful for incremental parsing. If I can output them, and thus preserve the size property of all of the reused node’s parents, but still not have the children actually output by Tree.build, that would be fantastic.

I’ll elaborate on the first one a bit more. I have a few use cases for it:

Scheduling a reparse of a nested language region due to the language async load finishing.
Doing some really expensive action - say, using a bunch of requestIdleCallback in order to compress a tree and get a bunch of reusable nodes without having to do it during a time sensitive parse.
And of course, just plain ol’ async parsing. Maybe it’s something that works with a web worker, or maybe its a “fine” parse vs a course sync parse, like a detailed semantic tokens pass.

I know this introduces issues with how parsing currently works. It’s quite handy that parsing is synchronous for everything interfacing with CM6. I think this should be preserved! A synchronous parse should always be supported in some sense - if a bit of code calls SyntaxTree, the parser should have some sort of output when it’s forced to finish synchronously.

I think that latter situation could be helped with an “urgent” property in parse context object. Or maybe simply what the timeout is for the parser, e.g. 25 milliseconds.

Oh, and on the second request - I did try making a custom BufferCursor class for it, but I just disliked how messy the position index reporting got. Instead, I just process the whole buffer into a valid Tree.build node list beforehand.

Finally, man CM6 is fast, regardless of what I throw at it. What sorcery is this!?

marijn · March 31, 2021, 8:53am

Hi. Thanks for the feedback. Could you elaborate a bit on how your desired asynchronous parsing differs from the current automatically scheduled calling of parser.advance by the parse plugin? Do you want to do the scheduling yourself and call a callback when it is done? That would be possible to add, I guess, but it’d push a bunch of complexity to into your code, such as when to update the editor state’s tree (which currently happens automatically when parsing reaches the end of the viewport, for example), when to stop doing work, and when to abort and restart because of incoming state changes (and, in that case, how to reuse partial work).

On approach would be to provide a more primitive version of the Language class which doesn’t load the parse plugin and requires you to implement your custom scheduling logic. That’d not be trivial, but I guess it might be helpful in some cases. (I didn’t quite get how your incremental parsing works, but I guess it doesn’t fit the mold of Lezer parsing).

Mostly an unreasonable insistence on keeping the complexity of everything possible sublinear to document size (and sacrificing some convenience for that goal). I’m sure we’ll find corner cases where it’s slow still.

Monkatraz · March 31, 2021, 10:02am

Okay, I don’t think I was very clear and I apologize. I think it’s less about needing control of the scheduler, and more about supporting “hey, I gave you a tree, but I wasn’t really done” or something like that.

I think CM6 uses a while (time < deadline) sorta system when it’s calling parser.advance. This doesn’t really jive well with anything async (like a timeout, or Promise), and I honestly don’t see any satisfactory way of doing something async that doesn’t involve spooky workarounds. The only workaround I can think of is to start a Promise and wait for it to finish by repeatedly returning null from advance until it is done. I don’t really like this though, because doesn’t this just block the main thread with a while (true) loop, nulling the whole point of doing something async?

One idea I had to remedy this is to leverage the existing scheduler functionality, by adding a new public function to the EditorParseContext. Something like this:

type ParseSchedule = (when: number /* ms */ | Promise, blocking?: boolean) => void

I would think any calls to this would get deferred until the current parse completes, so you can’t have begging-for-bugs overlapping parses happening.

I don’t think there would be anything magical about this function - it would basically describe “I want you to act like the tree needs to be updated in (x) time”. I think this is all that is needed for supporting a background task (like loading a language) and forcing a new tree when its done.

However, for some use cases, e.g. a web worker parser, this system would be really awkward to use. CM6 I think would need to support a completely Promise based PartialParse sorta interface, which would be nice to have, but not needed for my purposes right now.

EDIT: Just to give an example of a practical and compelling web worker parser: It’s totally possible to leverage ArrayBuffer transferring and some other tricks to get surprisingly low latency with a web worker. If you are parsing some really expensive language, say The Dark One (YAML), you can abuse the thread the web worker gets, avoiding annihilating the user experience, and still get good latency to boot. The issue is that this requires a lot of boilerplate and gluing - and of course, having to do everything async.

marijn · April 1, 2021, 7:23am

I’ve looked into parsing in web workers, but the kind of shared-structure trees that Lezer uses fit very poorly with this—in order to share structure, they have to be a tree of objects, not just a big flat array, and moving those through web worker boundaries is a serious pain. Also, the whole asynchronicity and coordination logic quickly gets complicated. So the fact that CodeMirror uses main-thread small-increment synchronous parsing is an intentional, well-considered decision.

For dynamically loading parsers on demand, there’s the getSkippingParser utility, which is used by the Markdown mode to temporarily skip a language it can’t handle, and restart parsing after it finishes loading.

I still don’t really get how your proposed interface would work. What does blocking mean in this case? And what, precisely, happens when you call this function from your parser?

Monkatraz · April 1, 2021, 10:56am

Basically:

Not sure if that was intentional, but the way that getSkippingParser works is basically what I meant. However, what I meant was closer to the idea of scheduleOn being public rather than privately used by the skipping parser. A public version of scheduleOn would probably be some kind of function that combines every given Promise into a Promise.all arrangement.

What does blocking mean in this case?

I figured a somewhat common issue with a “background task” would be that the task could potentially conflict with a reparse firing while the task is still working. blocking would tell the scheduler to stay “busy” and not let another reparse happen until the until / when Promise is complete.

Totally aware of how bad it is - which is why the only sane strategy I can think is a highly optimized ArrayBuffer transfer pipeline. The actual “tree” would be assembled from flat chunks sent from the web worker.

Also totally aware! I think any async parsing solution would be focused on the idea of only allowing async when it’s “safe.” For example, if the document was being async parsed and then a extension suddenly requests a syntax tree, the parser would have to somehow give a tree (basically forceFinish) that instant.

Basically, small-increment async parsing, so that there is a synchronously available tree if needed.

marijn · April 2, 2021, 10:37am

I’m still not really seeing the possibility of a simple interface here—things like not conflicting with other parses and eagerly fetching the tree would require a whole bunch of orchestration. Could you say a bit more about your concrete use case here? (I assume the stuff about parsing in workers was more of a future-possibility scenario?)

Monkatraz · April 2, 2021, 11:27am

Sorry if I haven’t been very clear - it’s also been difficult to share code because my code wasn’t really working yet! I also have only a very high-level understanding of CodeMirror’s internals - mostly from my adventures in debugging.

Basically, I had two points of discussion, with the first being the scheduler function thingy - and the other being a nebulous idea for how to handle an async parser. With the former, I’m mostly satisfied with the getSkippingParser change (I’ve already implemented it myself!) With the latter, it was entirely a future possibility scenario. However, I can actually give the real and utterly frustrating use case that made me want it.

I’m creating some really nice language support for a markup language that has a lot of questionable features. This markup language is… uh, it goes by multiple names - Wikitext, Text_Wiki (I think), some others - and it was made/designed by the Wikidot devs for their wiki system.

There is a slow fork/lumbering-rewrite of Wikidot underway, and we need to preserve backwards compatibility. I also want to make writing articles actually like, not terrible. As part of the project, there was an entirely new and shiny parser made (FTML is the name) in Rust, complete with diagnostics, AST representation, etc. It can be compiled to WASM.

The immediate idea was to use the parser directly - but it isn’t incremental and converting the resultant AST would probably be costly. One way around this might be to throw optimization to the wind and get a result as fast as possible from a web worker. This ultimately didn’t work out, as I needed a more detailed tree for the editor, so I’ve been writing my own parsing system that can handle the absurdity of Wikitext. Regardless, I could absolutely see a parser being reused like this in some other application. A WASM parser doesn’t have to be async, but it totally could be.

Oh, also - I really, really tried to get a parser for this dumb language made in Lezer - with plans to use a lot of external tokenizers. But oh my god, Wikitext, it’s just… so bad. Behold:

[[ul]]
 [[li class="item1" data-toggle="data1"]]Item1[[/li]]
 [[li_ style="color: red;"]]Item 2
  [[ol]]
    [[li_]]Item 2.1[[/li]]
    [[li_]]Item 2.2[[/li]]
  [[/ol]]
 [[/li]]
[[/ul]]

[[image image-source attribute1="value1" attribute2="value2"]]
[[image http://www.example.com/image.jpg attribute1="value1" attribute2="value2"]]
[[image exampleimage.jpg attribute1="value1" attribute2="value2"]]
[[image :first attribute1="value1" attribute2="value2"]]
[[image /another-page/exampleimage.jpg attribute1="value1" attribute2="value2"]]
[[image flickr:83001279 attribute1="value1" attribute2="value2"]]
[[image flickr:149666562_debab08866 attribute1="value1" attribute2="value2"]]

[[=image http://www.example.com/image.jpg attribute1="value1" attribute2="value2"]]
[[<image http://www.example.com/image.jpg attribute1="value1" attribute2="value2"]]
[[>image http://www.example.com/image.jpg attribute1="value1" attribute2="value2"]]
[[f<image http://www.example.com/image.jpg attribute1="value1" attribute2="value2"]]
[[f>image http://www.example.com/image.jpg attribute1="value1" attribute2="value2"]]

[[div class="foldable-list-container"]]
* Links
 * Wikidot
  * [*http://www.wikidot.com/doc Documentation]
  * [*http://www.wikidot.com/doc:wiki-syntax wiki-syntax]
  * [*http://community.wikidot.com/howto:howto-list How-To's]
 * Search Engines
  * [*http://www.google.com Google]
  * [*http://www.yahoo.com Yahoo]
* Main Category 1
 * [# Main 1 - Sub 1]
 * [# Main 1 - Sub 2]
* Main Category 2
 * [# Main 2 - Sub 1]
 * [# Main 2 - Sub 2]
 * [# Main 2 - Sub 3]
* Main Category 3
 * [# Main 3 - Sub 1]
[[/div]]

I hope you can see just how hard this could be for a LR or GLR parser to handle. It’s worse than you think too - those [link] blocks are only valid depending on special rules regarding their contents - if it turns out to be invalid, it’s just normal text. Oh and, if you hadn’t noticed - the name of a [[block]] block determines if it has an ending node or not. I cannot describe to you how much I hate this.

Also sometimes the blocks can have nested content in their start nodes. Like, I mean content, like markup. Just in there. You have to nest the entire language for it to work.

So, I’m making an extremely flexible regex-based parsing system that is similar to Textmate, except not awful. Always multi-line, separated tokenizer and parser states/stacks (parser lazily completes nesting nodes), supports lookahead and lookbehind, allows for optimized codepoint matching (if a regex is not needed), supports actual context tracking rather than string madness, etc. All incremental! Here is a small sample of a grammar definition (the syntax can get a lot crazier than this):

The weird @BR node types are a special, shorthand type for quickly referencing pre-made delimiter pairs.

I don’t actually know what this system can’t parse. It wasn’t in the previous images, but the grammar can even support conditional logic and inline functions as rules.

I’ve put in a lot of work to make this fast. I was always concerned about performance, especially considering the complexity, and my hope was that I could somehow use a web worker if everything went to crap. So far, I haven’t needed that escape hatch.

EDIT: Just to show what the resultant tree looks like, considering the input grammar:

Bracket pairs automatically have their openedBy and closedBy props set.

marijn · April 2, 2021, 11:49am

Okay, seems you’re going in a whole bunch of directions. Parsing a language like that with an LR parser is, indeed, probably hopeless. See also the entirely custom parser I wrote for Markdown. I’m not too optimistic about the ‘run a non-incremental parser on each change’ approach—that gets hard to scale as the document gets big. So again, I’m willing to provide a stripped-down base Language class that makes client code responsible for all aspects of parse scheduling, if you want to try going that way.

Monkatraz · April 2, 2021, 11:58am

I do want to make clear that the current interface made available to parsers is pretty damn flexible as it is. It’s just the one constraint of it being entirely sync that limits some options. I really enjoy how CodeMirror just kinda handles a lot of chores without me having to do anything.

Regardless, if an alternative scheduler could be provided, what constraints would that be under? Like, could I make it async? Would this be really bad for extensions that expect a tree immediately? Does it matter that they expect a tree immediately, if the scheduler causes an update event for them anyways when it eventually completes?

My hesitation is that I’d only be okay supporting wacky DIY parse scheduling if it isn’t going to be breaking extensions. I expect a lot of really cool extensions to be made for CM6 and I don’t want to introduce “compatibility” issues.

marijn · April 2, 2021, 12:36pm

I don’t know. It’d have to implement the Language interface somehow, but I haven’t thought about what that would look like with asynchronous parsing.

joshua211 · July 8, 2024, 9:24am

Are there any updates on this?

I have a usecase for async parsing like this. In my case, parsing happens in another process and i have to make an async call to get the result.