Showing off: cm-tarnation, an alternative parser

I’ve created an alternative Textmate-ish-kinda-sorta parser for CodeMirror 6. It’s called cm-tarnation, for no particular reason other than that I’m from Texas.

I’ll paste a bit from its readme:

An alternative parser for CodeMirror 6. Its grammar focuses on being extremely flexible while not suffering the consequence of being utterly impossible to understand. It’s inspired a bit by the Monarch and Textmate grammar formats, but pretty much entirely avoids the pitfalls of their systems.

Tarnation is not line-based. It is capable of reusing both previous and ahead data when parsing, making it fully incremental. It can restart from nearly any point in a document, and usually only barely parses the immediate region around an edit. It also doesn’t use very much memory, due to some clever usage of ArrayBuffer based tokens.

Of course, I should say that if you can use Lezer as your language’s parser, you totally should, because it’ll be faster and likely better behaved.

So, what does it look like? You define grammars in a JSON/YAML file. Here is a complex example:

comments:
  block:
    open: '[!--'
    close: '--]'

ignoreCase: true

repository:

  ws: /[^\S\r\n]/
  namela: /_?(?:@ws|@BlockEnd|$)/

  BlockComment:
    match: /(\[!--)([^]+?)(--\])/
    tag: (...) blockComment
    fold: offset(3, -3)
    captures:
      0: { open: BlockComment }
      2: { close: BlockComment }

  BlockStart:
    match: /\[{2}(?![\[/])/
    tag: squareBracket
    closedBy: BlockEnd

  BlockStartClosing:
    match: /\[{2}//
    tag: squareBracket
    closedBy: BlockEnd

  BlockEnd:
    match: /(?!\]{3})\]{2}/
    tag: squareBracket
    openedBy: [BlockStart, BlockStartClosing]

  BlockNamePrefix:
    match: /[*=><](?![*=><])|f>|f</
    tag: modifier

  BlockNameSuffix:
    match: "_"
    lookbehind: '!/\s/'
    tag: modifier

  BlockLabel:
    match: /[^\s\]]+/
    tag: invalid

  BlockNodeArgument:
    match: /(\S+?)(\s*=\s*)(")((?:[^"]|\\")*)(")/
    captures:
      0: { type: BlockNodeArgumentName, tag: special(propertyName) }
      1: { type: BlockNodeArgumentOperator, tag: definitionOperator }
      2: { open: BlockNodeArgumentMark, tag: string }
      3:
        if: $0
        matches: style
        then: { type: CSSAttributes, nest: style-attribute }
        else: { type: BlockNodeArgumentValue, tag: string }
      4: { close: BlockNodeArgumentMark, tag: string }

  BlockNameMap:
    lookup: $var:blk_map # external variable
    lookahead: /@namela/
    emit: BlockName
    tag: tagName

  BlockNameMapElements:
    # lookup is a list of strings that can be matched
    lookup: $var:blk_map_el # external variable
    lookahead: /@namela/
    emit: BlockName
    tag: tagName

  # blocks

  BlockNodeMap:
    emit: BlockNode
    indent: delimited(]])
    skip: /\s+/
    chain:
      - BlockStart
      - BlockNamePrefix?
      - BlockNameMap
      - BlockNameSuffix?
      - BlockNodeArgument |* BlockLabel
      - BlockEnd

  BlockContainerMap:
    emit: BlockContainer
    fold: inside
    begin:
      type: BlockContainerMapStartNode
      emit: BlockNode
      indent: delimited(]])
      skip: /\s+/
      chain:
        - BlockStart
        - BlockNamePrefix?
        - BlockNameMapElements
        - BlockNameSuffix?
        - BlockNodeArgument |* BlockLabel
        - BlockEnd
    end:
      type: BlockContainerMapEndNode
      emit: BlockNode
      indent: delimited(]])
      skip: /\s+/
      chain:
        - BlockStartClosing
        - BlockNamePrefix?
        - BlockNameMapElements
        - BlockNameSuffix?
        - BlockEnd

includes:
  blocks:
    - BlockNodeMap
    - BlockContainerMap

global:
  - BlockComment

root:
  - include: blocks

Assuming that the external variables are setup correctly, you get this:

I’m scratching the surface of what you can do with this, but overexplaining it would be boring. If you want to see a ginormous grammar made with this, you can take a look at this file.

I’m very proud of this because it’s surprisingly fast and easy to use. It’s also like, my third attempt at getting something like this nice to use? I still have additional plans for it, most of which are just focused on making it behave better in certain scenarios and making defining a large grammar less tedious.

4 Likes

Neat! Thanks for sharing.

Thanks for sharing @Monkatraz

How do you integrate cm-tarnation into a codemirror project? Does it generate a new Language object with an extension that can be added to EditorView?

When you create an instance of the TarnationLanguage class, you can call its load() method to get a ordinary LanguageSupport (this is synchronous). That can be loaded like any other Extension.

Additionally, if you need a LanguageDescription (e.g. nesting), the description property points to one.

EDIT: Just to show what this looks like:

Small example, with imports:

import { TarnationLanguage } from "cm-tarnation"
import texGrammar from "./tex.yaml"

export const TexLanguage = new TarnationLanguage({
  name: "wikimath",
  grammar: texGrammar as any
})

Big example, no imports:

export const FTMLLanguage = new TarnationLanguage({
  name: "FTML",

  nestLanguages: languageList,

  languageData: {
    autocomplete: completeFTML,
    spellcheck: spellcheckFTML
  },

  supportExtensions: [
    ftmlLinter,
    ftmlHoverTooltips,
    htmlCompletion,
    cssCompletion,
    addLanguages(TexLanguage.description, StyleAttributeGrammar.description)
  ],

  configure: {
    variables: {
      blk_map: blockEntries
        .filter(([, { head, body }]) => head === "map" && body === "none")
        .flatMap(aliasesFiltered),

      blk_val: blockEntries
        .filter(([, { head, body }]) => head === "value" && body === "none")
        .flatMap(aliasesFiltered),

      blk_valmap: blockEntries
        .filter(([, { head, body }]) => head === "value+map" && body === "none")
        .flatMap(aliasesFiltered),

      blk_el: blockEntries
        .filter(([, { head, body }]) => head === "none" && body === "elements")
        .flatMap(aliasesFiltered),

      blk_map_el: blockEntries
        .filter(([, { head, body }]) => head === "map" && body === "elements")
        .flatMap(aliasesFiltered),

      blk_val_el: blockEntries
        .filter(([, { head, body }]) => head === "value" && body === "elements")
        .flatMap(aliasesFiltered),

      // currently empty
      // blk_valmap_el: blockEntries
      //   .filter(([, { head, body }]) => head === "value+map" && body === "elements")
      //   .flatMap(aliasesFiltered),

      mods: moduleEntries.flatMap(aliasesRaw),

      blk_align: ["=", "==", "<", ">"]
    },

    // nesting function so that `[[code type="foo"]]` nests languages
    // @ts-ignore ts doesn't compile the correct type for this, for some reason
    nest(cursor, input) {
      if (cursor.type.name === "BlockNestedCodeInside") {
        // find the starting blocknode
        const startNode = cursor.node.parent?.firstChild
        if (!startNode) return null

        // check its arguments
        for (const arg of startNode.getChildren("BlockNodeArgument")) {
          const nameNode = arg.getChild("BlockNodeArgumentName")
          if (!nameNode) continue
          // check argument name, then check argument value
          if (input.read(nameNode.from, nameNode.to).toLowerCase() === "type") {
            const valueNode = arg.getChild("BlockNodeArgumentValue")
            if (!valueNode) continue
            const value = input.read(valueNode.from, valueNode.to)
            return { name: value }
          }
        }
      }

      return null
    }
  },

  grammar: ftmlGrammar as any
})

The as any is needed because the actual JSON structure of a grammar is strongly typed.

Interesting. Thanks again for the all details.

I’m sorry, but i can’t make it woks.

Should I use LanguageSupport?

import {LanguageSupport} from "@codemirror/language"

How to bind it to a extension?

====

Actually, I’m trying to highlight a chatting records, everyone have a different color.

I implemented it use monarch, not took a long time and works well. Just short code:


// Register a new language
monaco.languages.register({ id: 'talk' });

// Register a tokens provider for the language
monaco.languages.setMonarchTokensProvider('talk', {
  tokenizer: {
    root: [
      [/^([^(]+)(\(\d+\))(\s+)(\d{4}\/\d{2}\/\d{2} \d{2}:\d{2}:\d{2})$/, [{ token: 'nickname-$1' }, { token: 'userid' }, { token: '' }, { token: 'time', next: '@speak.$1' }]],
    ],
    speak: [
      [/^([^(]+)(\(\d+\))(\s+)(\d{4}\/\d{2}\/\d{2} \d{2}:\d{2}:\d{2})$/, { token: '@rematch' }, '@pop'],
      [/.+/, { token: 'message-$S2' }]
    ]
  }
});

monaco.editor.defineTheme('myCoolTheme', {
  base: 'vs',
  inherit: false,
  rules: [
    { token: 'message-Alice', foreground: 'f99252' },
    { token: 'message-Bob', foreground: '008800' },

    { token: 'nickname', foreground: 'ff0000' },
  ],
  colors: {
    'editor.foreground': '#000000'
  }
});

Works well for chat records like this:

Alice(12345) 2022/11/11 23:02:27 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Bob(12346) 2022/11/11 23:02:27 
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY

But when i switch to codemirror, lezer smashed my brain.
I took three or four times than monarch version and still not works.
I used to use tools like lex/yacc and peg, but I can’t understand how lezer works

Well, I wrote a stream parser and solved the problem.