how to match special chars in external tokenizers

dhrubomoy · August 27, 2025, 2:51pm

I am writing lezer grammar for a language where all keywords are case insensitive. I am following php style external tokenizer for that. However the problem arises when some keywords contain “#”, others contain “.”. I get “Unexpected character “#”” when I try to run lezer generator.

Is there a way to handle these keywords?

here is how I defined the external tokenizer

@external extend {Identifier} keywords from "./tokens" {
    money[@name=function],
    text[@name=function],
    timestamp#[@name=function],
    date#[@name=function],
}

I have changed the rule for Identifier to include some special characters:

@tokens {
	identifierChar { @asciiLetter | $[\u{a1}-\u{10ffff}_#$%@^|?\\.] }
	word { identifierChar (identifierChar | @digit)* }
	Identifier { word }
    //...more
}

In token.js

export function keywords(name) {
  let found = keywordMap[name.toLowerCase()]
  return found == null ? -1 : found
}

Any help would be appreciated. Thanks.

marijn · August 27, 2025, 9:01pm

You cannot name a Lezer token timestamp#. You’ll just have to use another name for the token. (Note that there’s no constraint that the token name matches the text content of the actual token.)

dhrubomoy · August 27, 2025, 9:04pm

Found a solution. Needed to remove the special chars from keywords.

@external extend {Identifier} keywords from "./tokens" {
    timestamp_hash[@name=function] // instead of timestamp#

In token.js needed to add those keywords to the map

import * as terms from "./my-parser.terms";

keywordMap["timestamp#"] = terms.timestamp_hash;

export function keywords(name) {
  let found = keywordMap[name.toLowerCase()]
  return found == null ? -1 : found
}