Tag Granularity for themes and comparisons to TextMate Grammars

I have seen comments in a few places (see below) about conventions for tags / scopes used in parsing and syntax highlighting which aren’t clear to me and wanted to try to clarify what is meant and what CodeMirror’s goal is with the approach taken.

TextMate-style open ended type names mean that everybody is going to define their own ad-hoc types, and theme writers just won’t know what to target. So you get impractically huge themes trying to target all the crap found in the wild

https://lezer.codemirror.net/docs/ref/#highlight.Tag

CodeMirror uses a mostly closed vocabulary of syntax tags (as opposed to traditional open string-based systems, which make it hard for highlighting themes to cover all the tokens produced by the various languages).

My understanding of textmate grammars is that these problems implied above are theoretically possible but don’t actually happen often, even with widespread use across most modern code editors.

The fallback system in textmate grammars / scopes works such that if a scope defines markup.list.numbered.markdown for which no highlighting rule is found, it will fall back to markup.list.numbered, then markup.list, then finally markup. So the top-level root scope names are the only things which theme authors need to target. Writing themes with huge numbers of scopes is something that theme authors would do if an only if they want to change specific things.

The more restrictive approach taken in codemirror makes it difficult to build themes with the same degree of control that is used in other editors, and seems to require one of the following options:

  1. Modifying the parser in order to change the opinions enforced by the mode author about how tokens should be grouped into a very limited number of available scopes
  2. Writing every mode with a huge number of custom tags exported, where each custom tag falls back to the standard set of tags defined. This basically re-creates textmate’s approach, but via imports rather than string construction, and I’m not sure why that would be worth it.

Am I missing a dimension of this?

The themes I looked at had all kinds of language-specific rules in them, which seems contrary to the idea of a generic theme.

The vocabulary provided by @lezer/highlight is not, I think, “very limited”. It has 78 different tags and 6 modifiers to work with. Typical programming languages should be able to tag all the constructs they distinguish with these with little problem, and exporting custom tags is not something that happens a lot.

What, concretely, is the problem you are having here?

I think the reason why language-specific rules are included is important. Theme authors choose to provide highlighting rules for specific modes, because they want to make a really nice theme that they have fine-grained control over, not because they need to in order to make a decent theme.

The concrete issue I’m having is exactly that, particularly for markdown (which I know is a special case). I’m making a theme and want to have page structure marks (Section header marks, list marks) colored separately from text emphasis marks.
The Markdown mode for Lezer assigns the same tag to all of these: markdown/markdown.ts at main · lezer-parser/markdown · GitHub

As far as I can tell, my only option is to modify the markdown mode (with custom tags for these things that are exported, etc.) in order to make a theme behave the way I want. Is that a good general way to go about making this type of finer-grained control possible for theme authors?

Those marks are distinct in the syntax tree, so you can extend the existing lang-markdown package to add additional tags for them without forking that or the parser.

and those additional tags would be custom tags that fall back to processingInstruction, which theme authors would have to import to their themes to target?

It seems like I would still have to fork the lang-markdown package to add that, no?

If they are local to your system you don’t have to fork lang-markdown, you can use a reconfigured form of the markdown parser by passing extensions to markdown().