Some of my simple-mode Unicode patterns don't work. Why?

I am adapting a toy language (Monkey) to a dielect (pua-lang) where the keywords are ridiculous Chinese techno-babble. Now what I want to highlight is the following:

赋能 拔河123 = 抓手(x) {
  细分 (x 对齐 0) {
    0;
  } 路径 {
    细分 (x 对齐 1) {
      1;
    } 路径 {
      拔河123(x - 1) 联动 拔河123(x - 2);
    }
  }
};

拔河123(10);

And starting from a working grammar for Monkey, I tried:

CodeMirror.defineSimpleMode('monkey', {
  start: [
    { regex: /".*"/, token: 'string' },
    { regex: /(?:fn|let|return|if|else|抓手|赋能|细分|路径|反哺)(?:\b|(?=\s|[()]))/, token: 'keyword' },
    { regex: /true|false|null|三七五|三二五/, token: 'atom' },
    { regex: /\d+|[-+]?(?:\.\d+|\d+\.?\d*)/, token: 'number' },
    { regex: /[-+\/*=<>!]|对齐|联动|差异|倾斜/, token: 'operator' },
    { regex: /[\{\[\(]/, indent: true },
    { regex: /[\}\]\)]/, dedent: true },
    { regex: /\p{XID_Start}\p{XID_Continue}*|[a-z$][\w$]*/u, token: 'variable' },
  ],
  comment: [],
  meta: {},
});

Now the keyword part looks over-compilated, but that’s just an idiosyncrasy of \b. Hardcode a look-ahead, and then it works in both the console and this grammar. What’s really weird is that some stuff work in the console (as /...regex.../u.exec('string')) but not in the grammar, specifically the operator and variable tokens.

What did I mess up here? (pr)

I think I messed up here, and the simple mode code is stripping the u flag from your regexps when it adds a leading ^. Does it work better with this patch?

1 Like

It works! Thanks. A few nitpicks though:

  • Wouldn’t it look more “uniform” to use += "i" for the ignoreCase branch too?
  • Would it make more sense to just take .flags off the original RegExp?

Some of the browsers the library targets don’t have RegExp.flags yet, unfortunately. As for that kind of uniformity, I don’t consider it terribly important.

1 Like