Heap out of memory error when adding support for case insensitive keywords (lots of them)

Getting error “reached heap limit allocation failed - javascript heap out of memory” after running lezer-generator for a grammar where I needed to add lots of keywords (nearly a thousand).

I have followed advice in this post to add support for case-insensitive keywords

But the problem started to begin when I added lots of keyword. Unfortunately the language needs to support all those keywords.

Here is part of the grammar

MyFunctions {
  onefunction "(" expression ")" |
  anotherfunction "(" expression ")"
  // 500 more...
}

@external extend { Identifier } keywords from "./tokens" {
  // Functions
  onefunction[@name=Function], anotherfunction[@name=Function]
  // 500 more...

  // Operators
  precedes[@name=CompareOp], follows[@name=CompareOp]
}

In tokens.js

export function keywords(name) {
  let found = keywordMap[name.toLowerCase()]
  return found == null ? -1 : found
}

It works fine as long as I have limited number of keywords. As soon as I add all the keywords that I need to support for the internal DSL, I get the heap out of memory error when I run the lezer-generator.

Is there any other way I can add support for case insensitivity for the keywords? Where perhaps I don’t need to define some of the keywords in the grammar file?

Any help would be appreciated. Thanks

I suspect the problem is that you have a ton of rules like onefunction "(" expression ")" for all those tokens. That’ll make the grammar big enough that trying to compile it will exhaust node’s heap space. Is it possible to combine all the keywords that have a similar role in the grammar into a single token type? That would probably help a lot.

Yes that should work. Thanks a lot for the solution.

I tried increasing Node’s heap size to 16GB, but I still encounter the out-of-memory issue. I wonder if there might be a bug or a memory leak causing it to require such an excessive amount of memory.

Did you make the change to your grammar that I suggested? Very complicated grammars will generate an unworkable amount of parse states, which will, without requiring any unintentional memory leak, exhaust your process memory.

I tried implementing the solution you suggested by using a single rule for similar functions. However, the issue is that I still need to validate whether the function names are valid, as only a specific set of functions is allowed.

The function names are also case-insensitive, so I tried using an external tokenizer to handle that. However, it’s still not matching. When I debugged the external function, I noticed that by the time it gets invoked, the next token is “(”, meaning the function name has already been missed. So I am traversing backwards using peek().

In my .grammar file I have

FunctionCall {
	FunctionWithNoParam { fnNameFunctionWithNoParam "(" ")" } |
	FunctionWithOneOptionalParam { fnNameFunctionWithOneOptionalParam "(" Expression? ")" }
	// ...and so on
}

@external tokens fnName from "./tokens" { fnNameFunctionWithNoParam }

in token.ts I have defined

export const fnName = new ExternalTokenizer(
  (input) => {
    let name = "";
    let i = -1;
    let prev = input.peek(i);
    while (![SPACE, LINE_FEED, EOF].includes(prev)) {
      name = String.fromCharCode(prev) + name;
      prev = input.peek(--i);
    }
    if (knownFunctionNames.has(name.trim().toLowerCase())) {
      input.advance();
      input.acceptToken(terms.fnNameFunctionWithNoParam);
    }
  },
  { contextual: true, fallback: true },
);

But its not working as I can’t find a way to acceptToken that works backwords.

I am worndering if this is the correct approach for validating function names? Is there a better way to achieve this?
Any help would be greatly appreciated. Thank you!

Ahh I think an easier solution might be to validate the function names in the linter extension I have defined. Instead of handling validation in the grammar, simply use an Identifier in the function rules:

FunctionCall {
    FunctionWithNoParam { Identifier "(" ")" } |
    FunctionWithOneOptionalParam { Identifier "(" Expression? ")" }
    // ...and so on
}

After parsing, walk through the syntax tree and validate the function names for each FunctionCall node.