Specifying special tokens in a grammar

nikku · January 10, 2020, 8:02pm

My language allows normal identifiers that are separated by spaces. On the other hand, it also recognizes a few special names that may span multiple words.

The grammar semantics are as follows:

(1) Certain words (i.e. OTHER or FOO BAR denote Special tokens)
(2) All other ascii char words without spaces denote normal Identifier tokens
(3) A word such as OTHERA is a valid Identifier and not a combination of Special(OTHER) + an error

I was not build rule (3) into a lezer based grammar. Is there a way to accomplish such matching behavior? Do I have to write a custom tokenizer to accomplish this?

My basic starting point grammar is shown below. I have to add @precedence here to make the token non-ambiguous. That however leads to the tokenizer not recognizing the whole word when it found a special token (cf. rule (3) mentioned above).

@top[name=Script] {
  AnyName+
}

AnyName {
  Identifier | Special
}

@skip { whitespace }

@tokens {

  whitespace { std.whitespace+ }

  Identifier {
    std.asciiLetter+
  }

  Special {
    "FOO BAR" |
    "OTHER"
  }

  @precedence {
    Special,
    Identifier
  }

}

marijn · January 12, 2020, 10:34am

The built-in tokenizer doesn’t support lookahead, so this is indeed rather tricky to implement. Do the multi-word names allow any whitespace between them, or just a single space?

One option is to write a custom external tokenizer for these, of course.

Or you could treat them as multiple tokens that are parsed as a single element by some other rule. That does require using @extend on the initial words, so that they can be parsed both as identifiers and as the start of a multi-word name. Something like (untested):

FooBar { @extend<identifier, "foo"> @specialize<identifier, "bar"> }

(Or, if only a single space is allowed, wrap the rule in a @skip {} {...} block and put a " " between the tokens.)

Finally, if identifiers-separated-by-spaces are otherwise invalid in your language, you could create a multi-identifier token (identifier (" " identifier)*) and use @specialize on that token type.

nikku · January 12, 2020, 7:20pm

Thanks for the hint to to specify spaced names as combinations of @extend<...> @specialize<...>*. This works as expected.

The resulting grammar that incorporates this idea looks like this:

@top[name=Script] {
  AnyName+
}

AnyName {
  Identifier | Special
}

Special {
  fooBar |
  other
}

fooBar { @extend<Identifier, "FOO"> @specialize<Identifier, "BAR"> }

other { @specialize<Identifier, "OTHER"> }

@skip { whitespace }

@tokens {

  whitespace { std.whitespace+ }

  Identifier {
    std.asciiLetter+
  }

}

The grammar correctly recognizes instances of identifiers as well as special names:

OTHERA => Identifier(OTHERA)
OTHER => Special(OTHER)
FOO BAR => Special(FOO BAR)
FOOB BAR => Identifer(FOOB) Identifier(BAR)