Some of my simple-mode Unicode patterns don't work. Why?

Artoria2e5 · April 20, 2021, 7:27am

I am adapting a toy language (Monkey) to a dielect (pua-lang) where the keywords are ridiculous Chinese techno-babble. Now what I want to highlight is the following:

赋能 拔河123 = 抓手(x) {
  细分 (x 对齐 0) {
    0;
  } 路径 {
    细分 (x 对齐 1) {
      1;
    } 路径 {
      拔河123(x - 1) 联动 拔河123(x - 2);
    }
  }
};

拔河123(10);

And starting from a working grammar for Monkey, I tried:

CodeMirror.defineSimpleMode('monkey', {
  start: [
    { regex: /".*"/, token: 'string' },
    { regex: /(?:fn|let|return|if|else|抓手|赋能|细分|路径|反哺)(?:\b|(?=\s|[()]))/, token: 'keyword' },
    { regex: /true|false|null|三七五|三二五/, token: 'atom' },
    { regex: /\d+|[-+]?(?:\.\d+|\d+\.?\d*)/, token: 'number' },
    { regex: /[-+\/*=<>!]|对齐|联动|差异|倾斜/, token: 'operator' },
    { regex: /[\{\[\(]/, indent: true },
    { regex: /[\}\]\)]/, dedent: true },
    { regex: /\p{XID_Start}\p{XID_Continue}*|[a-z$][\w$]*/u, token: 'variable' },
  ],
  comment: [],
  meta: {},
});

Now the keyword part looks over-compilated, but that’s just an idiosyncrasy of \b. Hardcode a look-ahead, and then it works in both the console and this grammar. What’s really weird is that some stuff work in the console (as /...regex.../u.exec('string')) but not in the grammar, specifically the operator and variable tokens.

What did I mess up here? (pr)

marijn · April 20, 2021, 8:24am

I think I messed up here, and the simple mode code is stripping the u flag from your regexps when it adds a leading ^. Does it work better with this patch?

Artoria2e5 · April 23, 2021, 2:16am

It works! Thanks. A few nitpicks though:

Wouldn’t it look more “uniform” to use += "i" for the ignoreCase branch too?
Would it make more sense to just take .flags off the original RegExp?

marijn · April 23, 2021, 8:30am

Some of the browsers the library targets don’t have RegExp.flags yet, unfortunately. As for that kind of uniformity, I don’t consider it terribly important.