Unicode ranges in regular

Cadmium · May 30, 2022, 11:17am

Is there a way to match on unicode ranges in the regular expressions. For example, in Java I would write [\u00A1-\u1FFF] for a single UTF-16 code unit range and [\\x{10000}-\\x{10FFFF}] for a UTF-16 pair code unit range.

What’s the syntax for these ranges in lezer regular expressions?

marijn · May 30, 2022, 12:34pm

Lezer grammars conceptually work with actual Unicode, not UTF16. Just write the actual character ranges you want to match, and the tool will generate the appropriate UTF16-matching state machine for you.

Cadmium · June 9, 2022, 7:41am

Yes, that seems to work. It doesn’t look amazing when doing ranges in the more “weird” parts of unicode though, this is part of my character set for example: ¡-῿‐-‛„-‧<U+202A>-⿿、-퟿-𐀀-􏿿. It contains some code points that don’t exist at the moment but are in a range that we want to be future compatible for. If you think that specifying the hex representation of a unicode is something that you would like and fits in the project, I could submit a PR.

marijn · June 9, 2022, 8:07am

Oh, no, you can absolutely use \u{hex} syntax in ranges, I just wanted to say that you have to express things in unicode code points, not UTF16 code units.

Cadmium · June 9, 2022, 8:11am

Ah! I misunderstood you, Thanks