My language allows normal identifiers that are separated by spaces. On the other hand, it also recognizes a few special names that may span multiple words.
The grammar semantics are as follows:
(1) Certain words (i.e. OTHER or FOO BAR denote Special tokens)
(2) All other ascii char words without spaces denote normal Identifier tokens
(3) A word such as OTHERA is a valid Identifier and not a combination of Special(OTHER) + an error
I was not build rule (3) into a lezer based grammar. Is there a way to accomplish such matching behavior? Do I have to write a custom tokenizer to accomplish this?
My basic starting point grammar is shown below. I have to add @precedence here to make the token non-ambiguous. That however leads to the tokenizer not recognizing the whole word when it found a special token (cf. rule (3) mentioned above).
The built-in tokenizer doesn’t support lookahead, so this is indeed rather tricky to implement. Do the multi-word names allow any whitespace between them, or just a single space?
One option is to write a custom external tokenizer for these, of course.
Or you could treat them as multiple tokens that are parsed as a single element by some other rule. That does require using @extend on the initial words, so that they can be parsed both as identifiers and as the start of a multi-word name. Something like (untested):
(Or, if only a single space is allowed, wrap the rule in a @skip {} {...} block and put a " " between the tokens.)
Finally, if identifiers-separated-by-spaces are otherwise invalid in your language, you could create a multi-identifier token (identifier (" " identifier)*) and use @specialize on that token type.