Reimplement markup language in Lezer or not

grayen · January 18, 2021, 4:44pm

I have created a markup language for which I have a fully working parser and printer. The language includes notation that leverages indentation, and it is designed to never fail parsing, it will just interpret the input as text in that case. The parser and printer now fully implement the semantics I had in mind, however I want to have editor support for when I use the language in CodeMirror, ProseMirror and VSCode, for which I want to support incremental parsing too. That is when I stumbled upon Lezer, which seems to be a great fit in terms of features.

Unfortunately the markup language semantics are not expressible in Lezer’s grammar as far as I am aware: indentation could be externalized like your Python grammar, but the recovery strategy does not match the semantics of the language. Given that Lezer is able to parse Markdown by skipping the standard grammar runtime, then surely my simpler language should be supported too.

Given that the only feature to really drop my parser and rewrite it in Lezer is the incremental parsing and that I only have the Markdown parser source code to base it on, would you suggest me doing this, or might I just as well try to implement my own incremental parsing?

I ask this because it seems this would require me to get in depth knowledge about the internals of Lezer (i.e. high learning curve) and I saw some comments in the Markdown parser about having to improve incremental parsing at some of its indentation state. And given my parser too has some state, I wonder how well incremental parsing would actually preform for my language, or whether I would be better of trying to add my own simplified incremental parsing that leverages my knowledge of the language semantics.

marijn · January 18, 2021, 7:11pm

What to do greatly depends on how expressible your syntax is as an LR grammar. Markdown is absolutely not, but I don’t know how yours works.

If your grammar accepts all inputs, recovery strategy should be irrelevant—it’ll never be invoked if no syntax errors are possible.

Having a parse that emits Lezer trees is helpful, since that allows CodeMirror’s highlighting/folding/indentation/etc logic to work with it. If you have an incremental parser already, you could either make it take an option to emit Lezer trees, or do this in a wrapping layer that is somehow clever enough to reuse unchanged subtrees from previous runs.

grayen · January 18, 2021, 8:01pm

The grammar does indeed accept all inputs (if not, that’s considered a bug). I tried to implement it as a LR, to see how far I could go, but cannot get it to backtrack as needed. Some time ago I tried implementing the grammar in ANTLR too, where I only managed to implement an approximation (it had known edge cases that failed) and the resulting parser was unacceptably slow. My current manually written recursive descent parser is extremely fast (twice as fast as my limited Lezer implementation, but I am sure that is just skewed, as I haven’t compared truly large inputs and memory-wise Lezer should win eventually).

My parser currently produces and AST rather than CST (Concrete Syntax Tree, why don’t you use that term rather than non-abstract syntax tree?). Considering that if I want to go for a CST that keeps track of input ranges and allows for cheap copies to support incremental parsing, I might just as well use lezer-tree. So I am going with trying to implement it in Lezer after all and am now in the progress of trying to figure out how to map what is happening in the Markdown parser to my language.

Am I correct in understanding your code that the logic of incremental parsing is implemented by the Markdown parser itself (considering it is outside of the Lezer runtime, thus lacks the information)? If so, then it is good to know that all I need to know should be in the Markdown parser and that is just a matter of puzzling out how to map the incremental parsing to the semantics of my language (the semantics for incremental parsing in my language, I already have).

marijn · January 18, 2021, 10:10pm

Yes. What it does is use the fragment information it is given on re-parse to figure out which nodes are safe to reuse. It has to use a WeakMap to attach some extra context information to block nodes (such as their indentation and things like the list marker character used), since re-scanning for that would make incremental parsing too slow.

Monkatraz · January 24, 2021, 3:23pm

My solution to this was just to port the entirety of the Monarch syntax highlighter to CodeMirror and have it output a Lezer tree. The way it does it is… um, kind of hilarious - but for my purposes it was basically my only option. This was because my particular markup language was just a heavily modified Markdown using Markdown-It, so I didn’t have any “complete” source for the language. I’d either have to either write my own parser, mangle the Markdown one, or do something else entirely. I just needed syntax highlighting and some very basic nesting in the Lezer tree, so I made a port that is incremental on a line by line basis.

I’m wondering if CM6 should just plain support some form of simplistic syntax highlighting without requiring a parse tree, simply because for some applications making a tree may be difficult or utterly impractical, and mangling a highlighter to export a Lezer tree probably has some really pointless overhead. Maybe a highlighter type that takes in a list of range + tag objects and maybe allows for noting where embedded languages are?

lishid · January 24, 2021, 8:17pm

You can do this if you implement a ViewPlugin that returns a list of Decoration.mark(), which are really just <from offset, to offset, props/css class/etc>. I’ve done something similar inspiring from the source code of the original highlighter plugin, here are some pointers if you want to pursue that.
The plugin itself
The part that creates the decoration

grayen · January 24, 2021, 10:05pm

@marijn After some experimentation and thinking it over, I ended up choosing to rewrite my parser with Lezer. My original parser was not incremental, i.e. no parsing in chunks or reusing said chunks to only reparse what is needed, so that required some work. However now it is correctly producing Lezer trees in chunks for a subset of my grammar, which can be converted to the original AST using the tree cursor.

I managed to understand the Markdown code, except for the hash usage inside the WeakMap, of which I am still a bit unsure. Is my understanding correct that the way the hash has been implemented helps to ensure that the tree can be safely reused inside that context? In other words, the hash is used to verify that the context in which the state was produced is still applicable for the new parse?

I am now working on the most complex parts of the grammar, the parts that required state, and after verifying it its equivalent to my old parser, I will try to get reusing fragments to work.

@Monkatraz If my goal was just highlighting no matter the costs, I already had a (hacky) workaround in my AST parser to keep track of position information as well, so I could have used that to produce highlighting ranges, but I did not want to have to reparse the whole input on every edit. So no matter what I will have to rewrite the parser to be incremental. I thought I could leverage the language semantics to make my own simpler incremental parser compared to Lezer, but if you consider everything, you end up with something very similar, so I chose to just go with Lezer.

@lishid I have yet to look at language support in CodeMirror, but when I do, it is good to know how I would have to approach highlighting. Thanks! Although in my case I can probably just use the highlight plugin as-is, given that I am producing Lezer trees after my rewrite is finished.

marijn · January 25, 2021, 7:49am

Yes, that’s the idea. It encodes indentation and, for blocks like lists, the list character used. Since block parsing is dependent on the current block context, this is necessary to decide whether a given block can be reused.

The architecture of lezer-markdown is loosely based on markdown-it, and I have plans to allow configuration in a similar way. Once that’s in place, it is possible that you may be able to extend it to parse your language directly.

Monkatraz · February 9, 2021, 5:55pm

Hey, I’ve noticed the changes on the lezer-markdown repo. and the changes you’re cooking up look really slick. Thank you for being so quick on this! I guess I won’t need my silly Monarch port… (it’s still extremely useful for making absurd syntax highlighters! regex lookbehind is dangerous)

However, this has made me wonder - do you think that the parse tree that the parser outputs is good enough for HTML rendering? Currently, I have two separate implementations of the same grammar, one in my Monarch port, and one in my markdown-it extensions. It would be very neat to have a consolidated and incrementally rendering Markdown implementation, with extensibility (I legitimately think you’ve incidentally made something very compelling).

marijn · February 10, 2021, 9:01am

Ah cool, I was planning to find this thread and ask you to try the changes, great that you already spotted them.

You could do HTML rendering based on the syntax tree. It’d be a bit awkward—you’d still have to resolve links, make sure you strip out the text covered by markup tokens, and so on. But it might beat maintaining two parsers.

Monkatraz · February 10, 2021, 8:19pm

Considering how well-formed the syntax tree Lezer puts out actually is, I’m thinking a decent approach would be a configuration object that tells some sort of ‘tree walker’ what to output into a string whenever it encounters (or leaves) a node. A slick solution might be to build this information into the node-type metadata, so that an extension to syntax could maybe provide the information needed to render it.

I’ll give this a try later - I’ll have to figure out how to actually make this ‘stringifying’ step incremental. (incoming rubber duck debugging) One option is to simply cache the output of already processed nodes (perhaps just block nodes) or to actually adjust the string as the input changes. The former is actually a hacky edit I’ve made to markdown-it already, as I have an incredibly low latency Markdown preview for my editor that needed that optimization.

Finally, I’ll need to do some basic sourcemapping between the Lezer tree and the output. My live-preview scroll-syncs in both direction, and it does this by putting a data-line attribute on sourcemapped HTML nodes from markdown-it. I could maybe assemble a full sourcemap using magic-string or something, and offload this work by tossing it at a web-worker.

If I do get this all working, I’ll definitely release my work and let you know of any friction I encounter along the way. I assume this is going to become more frequent of a task taken up by users, considering the raw power of Lezer.

As for an actual request: Some sort of “Inspect Tokens” feature like VSCode has would be really handy for debugging the output of a grammar. Currently, you have either look at the raw tree through some sort of pretty printer or by just visual examination of the token colors to debug a grammar.

marijn · February 10, 2021, 9:07pm

Trees have a toString method, which usually works well for me if I can keep the input small. If you want to build some kind of interactive tree exploring interface that’d be cool, but I myself don’t expect to work on that.

grayen · February 10, 2021, 10:21pm

I shared a pretty printer for this purpose recently: What's the best to test and debug grammars? - #3 by grayen

It is far from perfect, but it has helped me a lot while developing my grammar.

Monkatraz · February 11, 2021, 12:36am

That’s pretty nice. Thanks!

ps: if you want that grammar, it’s using cm6-monarch (made by me self-plug), here ya go:

const lezerTree = createMonarchLanguage({
  name: 'LezerTree',
  lexer: {
    defaultToken: 'string',
    unicode: true,
    tokenizer: {
      root: [
        [/[\u251C\u2514\u2502\u2500]/, 'punctuation'],
        [/'.+'$/, 'string'],
        [/(\w+\s)(\[)(\d+)(\.\.)(\d+)(\]:?)/, [
          'label', 'squareBracket', 'integer', 'operator', 'integer', 'squareBracket'
        ]]
      ]
    }
  }
})

fy0 · March 23, 2022, 11:27am

Seems cm6-monarch not works now.
Lezer is really hard to use.
I spent three times as long to use it than monarch, but still not works