$[\uFFFF] matches any single code point that uses two bytes in UTF-16, possible bug?

I was trying to write a token rule that matched any single character that takes more than one byte to represent in UTF-8.

I expected that the following would work:

MultibyteChar { $[\u0080-\uD7FF] | $[\uE000-\uFFFF] | $[\uD800-\uDBFF] $[\uDC00-\uDFFF] }

It does work. However, I noticed that MultibyteChar { $[\u0080-\uFFFF] } actually does what I want, because MultibyteChar { $[\uFFFF] } seems to match both UTF-16 bytes of codepoints above 0xFFFF.

Is this a bug? It seems like this would prevent you from matching any specific codepoints above 0xFFFF.

Another example is that MultibyteChar { "š€€" } does not match š€€ (0x10000), but MultibyteChar { $[\uFFFF] } does.

MultibyteChar { "š€€" } seems to match that input just fine when I test it.

But indeed, the parser uses \uffff as a special marker (I think I assumed it wasnā€™t a valid character when I made that choice) which messes things up. Iā€™ll take a look at how to fix that.

Also note that grammars are not UTF16, in that you canā€™t specify surrogate pairs as two separate characters, only as unicode characters.

Ok thanks. I just though it was worth mentioning. Thatā€™s a good choice that the grammars arenā€™t in UTF16. What should be the proper way of matching any character above 127?

The patches below should fix the confusion around direct mentions of character 0xffff. Unfortunately, npm is having some kind of issue right now and I canā€™t push new releases. But matching any character above 127 would be done with something like $[\u{80}-\u{10ffff}] even with the current versions of the libraries.

Thatā€™s perfect. Thanks!

Hmmm. I noticed that $[\u{80}-\u{10ffff}] doesnā€™t work with my test cases, however $[\u{80}-\u{10fffe}] does work.

What does your test case look like?