Use Unicode Classes in token definition

Hi! I’m writing a grammar for a language whose variable names and numbers can be written in any script, like Arab, Korean, etc. For this I was trying to use Unicode character classes to match the letters of any script. This regex looked like this:

LetterOrUnderscore { $[\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}_]+ }

And for numbers:

PositiveNumber { $[\p{Nl}\p{Nd}\p{No}]+ ("." $[\p{Nl}\p{Nd}\p{No}]+)? }

But then Lezer told me this rule had an overlapping character range. After changing the regexes to this, which would match any letter in any script and underscore:

LetterOrUnderscore { $[\p{L}_]+ }
PositiveNumber { $[\p{N}]+ ("." $[\p{N}]+)? }

Lezer compiled but still didn’t match what I expected, so I wonder if I can even use this syntax at all in Lezer. And if I can’t, what would be the alternatives?

Lezer has no support for character classes at all. The syntax that you are trying to use does not exist in this tool.

Since tokens are distilled down to simple state machines with ranges of characters as edges, support for this would lead to very big parser files due to the big sets of ranges. It is recommended to just use a crude overly-permissive range (such as treating everything above 0xc0 that doesn’t serve another purpose in your grammar as a letter).

Ah that makes sense! This is why you use this range in the language packages $[_\u{a1}-\u{10ffff}]? Thank you for the quick reply!

I am not sure if my opinion here matters, but I think leezer should have unicode ranges builtin. When I found out that it had only builtin ranges from the ASCII character set, like @asciiLetter it really felt a bit antiquated. Currently I don’t strictly need non ASCII, but that might change.

1 Like