$[\uFFFF] matches any single code point that uses two bytes in UTF-16, possible bug?

SirBogman · November 4, 2022, 4:17pm

I was trying to write a token rule that matched any single character that takes more than one byte to represent in UTF-8.

I expected that the following would work:

MultibyteChar { $[\u0080-\uD7FF] | $[\uE000-\uFFFF] | $[\uD800-\uDBFF] $[\uDC00-\uDFFF] }

It does work. However, I noticed that MultibyteChar { $[\u0080-\uFFFF] } actually does what I want, because MultibyteChar { $[\uFFFF] } seems to match both UTF-16 bytes of codepoints above 0xFFFF.

Is this a bug? It seems like this would prevent you from matching any specific codepoints above 0xFFFF.

SirBogman · November 4, 2022, 4:28pm

Another example is that MultibyteChar { "𐀀" } does not match 𐀀 (0x10000), but MultibyteChar { $[\uFFFF] } does.

marijn · November 4, 2022, 8:11pm

MultibyteChar { "𐀀" } seems to match that input just fine when I test it.

But indeed, the parser uses \uffff as a special marker (I think I assumed it wasn’t a valid character when I made that choice) which messes things up. I’ll take a look at how to fix that.

Also note that grammars are not UTF16, in that you can’t specify surrogate pairs as two separate characters, only as unicode characters.

SirBogman · November 4, 2022, 8:22pm

Ok thanks. I just though it was worth mentioning. That’s a good choice that the grammars aren’t in UTF16. What should be the proper way of matching any character above 127?

marijn · November 5, 2022, 9:18pm

The patches below should fix the confusion around direct mentions of character 0xffff. Unfortunately, npm is having some kind of issue right now and I can’t push new releases. But matching any character above 127 would be done with something like $[\u{80}-\u{10ffff}] even with the current versions of the libraries.

SirBogman · November 5, 2022, 9:19pm

That’s perfect. Thanks!

SirBogman · November 5, 2022, 9:25pm

Hmmm. I noticed that $[\u{80}-\u{10ffff}] doesn’t work with my test cases, however $[\u{80}-\u{10fffe}] does work.

marijn · November 6, 2022, 10:37am

What does your test case look like?