Is it feasible to create a mode based on existing AST?

ruifortes · April 6, 2018, 1:40am

Hello.
I’m parsing some specialized markdown (using unified.js) and wanted to improve the syntax highlighting a bit.
I took a look at the “Writing CodeMirror Modes” topic and the “Overlay Parser Demo” but I’m wondering if I could use the AST created by unified.js instead of the tokenizer approach.
The AST is completely source-mapped so it should be simple to use it for highlighting.
Is this possible?
Thanks

marijn · April 6, 2018, 6:40am

Not really. If you can directly use the tokenizer from your parser, and call it token-by-token, that might be useful for building a mode, but an AST is hard to adapt to the local, incremental way CodeMirror modes run.

ruifortes · April 6, 2018, 8:29am

Is it possible to compute the AST at the beginning and pass it along to each token function?
I know there is a “state” argument but I’m not sure about its use.
Is there any initial event I can catch to to parse the AST (and eventually some intermediate line-by-line array storing class info) and make it available downstream?
Is there any disadvantage on using AST for syntax highlight?
I guess performance could be an issue but I’m already parsing the AST on text change (with some debounce though…)

cben · April 8, 2018, 5:04am

In theory, you could not use a mode at all, instead deleting and creating marks (in a transaction) according to ast each time you re-parse.
I guess it would feel much slower than normal highlight ing.

ruifortes · April 11, 2018, 10:46am

Since the AST is completely source mapped changes could be handled without reparsing the whole text.
If changed in an “immutable.js” way wouldn’t traversing the AST (and eventually a second structure maintaining just a line by line styling information) and updating the relevant highlighting be quite fast??

cben · April 12, 2018, 9:28am

Updating the highlighting may be fast. My suspicion of performance was more about the parsing itself.
Does you parser have any kind of incremental interface? Say you’re editing a 1000-lines doc, and cursor is around line 201. Can you snapshot parser state after seeing 200 lines, and then feed it various versions of line 201 without re-parsing from the start? Can you feed it just a few lines more instead of all 201-1000? (*)
These are the things that keep CodeMirror modes fast even for very long docs.

But computers are fast, so perhaps for your use case you’ll get perfectly adequate performance even from full re-parsing… Only way to know is to try

CodeMirror mode interface does insist you return token styling for a line immediately when seeing ~~just that single line~~. Even AST parser with a somewhat incremental interface probably won’t fit well. (That’s what Marijn meant by “hard to adapt” I think).
That’s why I suggested above to have no mode, re-parse as I suppose you do now on (debounced) “change”/“changes” events, and then use marks API to apply the formatting.
Wait, I’m out of date, CodeMirror modes can now use lookAhead to see more than one line while returning styling for one line! So if your parser is “somewhat incremental” perhaps you can wrap it as a mode.

(*) Markdown is mostly lookahead free on block level, so you can fake “parse just a few lines” by parsing a truncated doc, at say number of lines in view + up to end of paragraph.
P.S. Note also that off-the-shelf markdown parsers are not exactly what you want in an editor while typing in progress. Consider text with *unterminated inline emphasis. Markdown says that’s not emphasis, just text with a literal *, until you add a closing *. (CodeMirror’s markdown mode highlights it as italic from the start, which is a nicer editing experience.)

P.P.S. if you build highlighting by a “real” parser, please share your results. Especially for markdown, due to proliferation of variants & extensions, I think many people would be interested in editor that exactly matches the target processing — at least I am