Nested parsers without end marker in SimpleMode

cesar · January 7, 2021, 3:45pm

Apologies if this has already been asked before, but my use case is rather unique and I couldn’t find anything exactly like what I’m trying to do.

I have a SimpleMode that wraps other language modes, various lines of the following:

{regex: /^%py.*$/, token: "meta", sol: true, mode: {spec: "text/x-python", end: /^(?=%)/}}

Basically, sub languages are delineated by a %lang marker or such at the start of the line.
There’s no explicit end marker, the “end” marker is the next %lang marker at the start of a line.
Unfortunately there’s no “sol” option for the end marker, so the lexer is currently broken.

But it turned out that the problem is actually a bit more complicated.
Some markers are special and don’t nest other modes, perhaps an example:

%py
print("Hello World")
%clear
print("Hello World")  # still python
%js
console.log("Hello World)"

For this to work, the next %marker shouldn’t necessarily pop back to the previous state, they should keep stacking indefinitely.

If I controlled every language I could simply add a mode switch to the top level SimpleMode whenever they themselves find a %marker at the start of a line, but the idea is that the languages might be existing ones like Python or Javascript.

Is there a way to accomplish this? Perhaps by wrapping every sub language in an overlay mode?

This work is being done in the context of JupyterLab which I assume will eventually switch to CodeMirror 6 so I’d rather not commit to writing a complicated custom CodeMirror 5 solution that will have to be replaced later on.

marijn · January 7, 2021, 4:05pm

A directly written mode (without simple-mode) should be able to handle this just fine—tokenizing inner modes goes through the outer mode, so it can always interfere there if it needs to. This doesn’t have to be very complicated.

cesar · January 7, 2021, 4:19pm

Thank you for the speedy reply! I was afraid a directly written mode would be necessary, something similar to htmlmixed I’m guessing. I tried looking at it before but found it rather confusing.

Will CodeMirror 6 need a similar approach? Or can Lezer handle this more complicated case?

marijn · January 7, 2021, 4:31pm

CodeMirror 6 currently has no way to handle a mode like this at all—Lezer also uses specific end tokens to delimit nested parses.

cesar · January 7, 2021, 4:38pm

Alright, interesting, I guess if it supports non capturing regular expressions (e.g. /^(?=%)/) it can at least support this use case but going back to a default language, rather than to a previous language.

SimpleMode in CodeMirror 5 would also support that if there was a sol parameter to mode: {spec: ..., end: ..., sol: true}

I don’t know if you’re still doing improvements to CodeMirror 5 but that would be a nice addition to have. Either way, thank you very much for your time, I’ll read up on the documentation to understand how to write a direct mode.