Syntax Highlighting Without a Language

ceramichacker · August 13, 2024, 10:17pm

Hi!

I’m working on editor support for s-expressions that have a schema.

I’ve written a stream-based, resumable tokenizer and “parser”/validator. The parser is a state machine that ingests tokens, and returns information about the tokens, or any potential errors.

Architecture

The current design is:

A state field runs the tokenizer / parser for ~100 lines, and saves a resumable checkpoint of tokenization/parse/error state, whenever it sees a transaction with a “trigger parse” state effect
If the state field sees a doc change, it discards all checkpoints after the start of the edit.
On every update, if the state field hasn’t finished parsing the doc, a view plugin
schedules a chain of setTimeouts, which dispatch “trigger parse” effects to the state field until parsing is done.

Autocomplete, info-on-hover, and syntax highlighting are then easy to implement by requesting an ad-hoc tokenization + parse of the viewport contents starting from the closest checkpoint. This works pretty performantly, and allows sharing the expensive parsing state across all usages. It’s also nice that the core logic could be extracted out to power an LSP for other editors.

I’m running into an issue with syntax highlighting, because I haven’t found an API that would allow me to colorize tokens via the “current” theme. It’s possible to get a list of all the theme classes, but I can’t find an API to get the current highlighter, or the current tokenTable, both of which I would need to map a string through the tokenTable to a classname.

The “correct” way to solve this would be to create a Codemirror Language, but this is a bit tricky in this particular case:

Codemirror’s Language Tools

Compiling an s-expression schema to a Lezer grammar would be very difficult, and not very performant since we couldn’t pre-compile the grammar, so we’d need to do so at runtime.

A Codemirror stream parser is also tricky, because there exist ambiguous states where we need additional tokens to figure out the “kind” of a token. If a line ends in such an ambiguous state, we have no way of revising the previously emitted tokens.

A simple example is “option” types. (Some x) and (x) both map to Some x, so if we are in a string option, and we’ve read (Some, we don’t know if Some is a label, or a string payload.

Another example is “unions”. For instance, a schema could be a union of
(String * Int * Float) | (Int * String * Bool). If we had (1 1, we don’t know which
token types to emit until we reach a point where one or zero of the union branches are still possible.

These schemas are rare, but they’re possible.

Also, it doesn’t look like there’s a way to share work between the parsing we need to do for linting, autocomplete, and info-on-hover, and the parsing we need to do for syntax highlighting. This seems like it would also be the case for a completely
custom Parser.

APIs that Might Solve This Usecase

Maybe the most convenient tool for this niche usecase would allow declaring tokens as a function of a state field and the current viewport:

provideViewportTokens : EditorState -> viewportStart:int -> viewportEnd:int -> (start:int, end:int, tokenType:string) list

A more refined version of this would probably account for parsing state, and allow reacting to transactions, and scheduling work.

Effectively, this would be a lower-level version of Language where constructing and updating the parsetree is managed entirely by the extension.

Beyond this use case, such a system would make it much easier to power syntax highlighting in code mirror from an LSP, and enable semantic highlighting, with colors coming from a theme.

A simpler solution to implement would probably be a function that maps a token name (in the same format as returned by a StreamParser) to the CSS classname for that token type, or maybe a mark decoration:

tokenClass : tokenType:string -> string | null

The biggest immediately obvious downside is that it would be impossible to
de-register the autogenerated CSS classnames if they are exposed to callers, which might be solved by returning mark decorations, or something else opaque.

marijn · August 14, 2024, 7:44am

Maybe highlightingFor is what you’re looking for. I still don’t recommend reinventing parsing, but if you want to, the system should be modular enough to do this.

ceramichacker · August 14, 2024, 3:16pm

Thanks, I think that’s exactly what I was looking for!