External tokenizer getting invoked too late

dhrubomoy · September 3, 2025, 3:36pm

I’m trying to match a string using an external tokenizer, but the function is being invoked too late. The Identifier token attempts to match first, which is expected behavior. However, the external tokenizer matches too late in the process.

Here’s the relevant part of my grammar:

TableLabel { tableLabel ":" }
LoadStatement { TableLabel? load String }
VariableName { Identifier }
SetStatement { set VariableName "=" (String | Number)}
// ... more
@tokens {
	identifierChar { @asciiLetter | $[\u{a1}-\u{10ffff}_#%@^|?\\.] }
	word { identifierChar (identifierChar | @digit)* }
	Identifier { word } 
	// ... more
}
@external tokens tableName from "./tokens" { tableName }

In tokens.ts:

// Match anything until it reaches a colon
export const tableName = new ExternalTokenizer(
  (input) => {
    let { next } = input;
    let hasContent = false;
    while (next !== -1 && next !== COLON) {
      hasContent = true;
      input.advance();
      next = input.next;
    }
    if (hasContent) {
      input.acceptToken(terms.tableName);
    }
  },
  { contextual: true, fallback: true },
);

When I input something like:

tablename:
Load "somefile.csv";

The issue is that the tableName function gets invoked when the next token is “:” instead of at the beginning of the string. “tablename” gets recognized as a VariableName, and I get a syntax error.
I tried defining the tableName external tokenizer before the @tokens block and added extend: true to the function. However, this approach significantly slows down the parsing performance.

Any suggestions on how to fix this problem?

marijn · September 3, 2025, 4:50pm

Put the @external tokens declarations before your @tokens block to give it a higher precedence.

dhrubomoy · September 3, 2025, 5:28pm

Thanks for the reply. I tried that solution as well. It didn’t work as it is. When I added {extend: true} it worked. But the I get performance issues with the parser. And when there is a syntax error in some position in the input, the syntax highlighting breaks for the rest of the input.
Is there anything wrong I am doing in the tokenizer function that you can spot?

marijn · September 3, 2025, 6:27pm

Well, yes. Your loop blindly scans forward to the next colon, and always accepts a token if there’s any content before that. You’ll probably want to check that you only read identifier characters.

dhrubomoy · September 3, 2025, 7:26pm

You’ll probably want to check that you only read identifier characters

How do I implement that check?

The problem is that the external tokenizer needs to match special characters as well (for example, “tableName$(var1)_table”). I tried updating the Identifier token to include special chars, but that creates other issues.

Is an external tokenizer the right approach for this problem? Are there alternative approaches that might work better?

marijn · September 4, 2025, 9:22am

Doing this on the grammar level, with a TableName node that wraps whatever expressions can occur in a table name, seems like a better approach. You’ll just have to be careful to shape the grammar in such a way that the parser performs the same actions on non-table name and table name expressions up to the point where it sees the colon, to avoid LR conflicts.