Match a token at the beginning of a line in a language mode

Hi there,

I am writing a language mode for Codemirror. Writing a Lezer parser seemed complex, so I settled upon using the simpleMode function in @codemirror/legacy-modes, and now I want to match a token which only appears at the start of the line.

I have the following language mode:

export const myLang = simpleMode({
  start: [
    {
      regex: /\b[0-9]+\b/,
      token: 'number',
    },
    {
      regex: /^\w+:/,
      token: 'keyword',
    },
    {
      regex: /;[^\n]+/,
      token: 'comment',
    },
  ],
  languageData: {
    name: 'myLang',
    commentTokens: { line: ';' },
  },
});

However, if I give an input to the editor such as “foo: bar: baz: qux”, all three tokens “foo:”, “bar:” and “baz:” are recognized as keywords, instead of only “foo:”. I suspect this is because the regex is applied incrementally to the rest of the string.

A Stackblitz is available here: Webpack.js Getting Started Example - StackBlitz, if you want to try it out.

How can I modify the mode so that keyword tokens are only recognized at the beginning of a line?

Thank you.

You can put sol: true in the token object to only match at start of line.

1 Like

@marijn thanks for your answer, and I had a follow up question on language modes.

I’m trying to modify my language mode to recognize keywords like “if” and “else”. I have the following language mode:

  start: [
    {
      regex: /\b[0-9]+\b/,
      token: 'number',
    },
    {
      regex: /\b(if|else|for|let)\b/,
      token: 'keyword',
    },
    {
      regex: /;[^\n]+/,
      token: 'comment',
    },
  ],
  languageData: {
    name: 'myLang',
    commentTokens: { line: ';' },
  },

Even with the word boundaries on the keyword regex, I find that it’s matching occurrences within substrings, such as finding “let” within “alet” like so:

Here’s a Stackblitz demonstrating my issue, if you want to give it a go.

Do you have any suggestions as to how I can match a word without matching substring occurrences? Thanks again.

Add a token for identifiers, so that letters aren’t being matched one at a time. These regexps are ran against a sliced string starting at the start of the token, so the \b at its start will do absolutely nothing.