Multiple languages in same document based on line numbers

heyman · December 26, 2022, 11:32am

Hi!

I’ve just started experimenting with Codemirror, and it looks like a very well-written project.

Would it be possible with Lezer to create a language that could provide syntax highlighting for multiple different languages in the same text buffer based on specific line number blocks instead of content in the text? So instead of using something like <script> and </script> to know that a block should be parsed as JavaScript, we can have an external data source that says that lines 1-20 should be parsed as JavaScript, while lines 35-42 should be parsed as CSS (and the rest should be considered plain text).

Could this be achieved with External Tokens?

Best,
Jonatan

marijn · December 26, 2022, 1:33pm

You could probably do something like that with an external tokenizer plus a context to keep track of the current line number, but it’d be somewhat awkard. Where is the information about which language occurs on which line coming from?

heyman · December 26, 2022, 2:03pm

Thanks for the info! I’ll try to describe what I want to do in more detail.

I want to have an editor that is divided into blocks that span a number of lines (all lines should be within blocks). Each block of lines should be able to contain content in different languages (plain text, javascript, CSS, etc.). The blocks should also have different background colors. It should be possible to initiate new blocks (for example by pressing Cmd-Enter) and manually change the language of a block of lines. So I was thinking of tracking the block line numbers in an external data structure that would automatically get updated when the document changes.

Another approach I’ve thought about would be keeping block separators in the buffer content instead and use some obscure unicode characters to denote the start of different blocks (text, javascript, CSS, etc.). The separator character should then be rendered as a new line and I’d have to manually replace the block separators with \n when copying and pasting text. Do you think this sounds like a better approach?

marijn · December 31, 2022, 2:12pm

Another thing you could try is to have your block-structure metadata keep a flat syntax tree for the extent of the blocks (reusing nodes for unchanged blocks so that incremental parsing works), and make the top-level parser a kind of pseudo-parser that just returns this tree, using parseMixed to run block-specific parsers for each part.

heyman · January 2, 2023, 10:14am

Yes, this is what I ended up doing (I think), and it seems to be working good so far. I’ve made a top level custom parser with a syntax for declaring blocks and their language, and I use parseMixed to set the the language for each block. I then hide then replace the tokens that specifies blocks/language with a widget that hides the string.

Thanks a lot for you help!