Unicode ranges in regular

Is there a way to match on unicode ranges in the regular expressions. For example, in Java I would write [\u00A1-\u1FFF] for a single UTF-16 code unit range and [\\x{10000}-\\x{10FFFF}] for a UTF-16 pair code unit range.

What’s the syntax for these ranges in lezer regular expressions?

Lezer grammars conceptually work with actual Unicode, not UTF16. Just write the actual character ranges you want to match, and the tool will generate the appropriate UTF16-matching state machine for you.

Yes, that seems to work. It doesn’t look amazing when doing ranges in the more “weird” parts of unicode though, this is part of my character set for example: ¡-῿‐-‛„-‧<U+202A>-⿿、-퟿-𐀀-􏿿. It contains some code points that don’t exist at the moment but are in a range that we want to be future compatible for. If you think that specifying the hex representation of a unicode is something that you would like and fits in the project, I could submit a PR.

Oh, no, you can absolutely use \u{hex} syntax in ranges, I just wanted to say that you have to express things in unicode code points, not UTF16 code units.

1 Like

Ah! I misunderstood you, Thanks :slight_smile: