getting stack overflow while stress testing very large input

shche · November 30, 2023, 10:51pm

Hello Marijn,

I have this grammar:

@top top { Expr }

@precedence { plus @left }
@skip { space }

Expr { Number | Sum }
Sum { Expr !plus "+" Expr }

@tokens {
  Number { @digit+ }
  space { @whitespace+ }
}

I parse this input:

`${'1+'.repeat(2000)}1`

(large amount of numbers with pluses in between).

I get a stack overflow exception RangeError: Maximum call stack size exceeded in @lezer/common (in tree.ts file in takeNode function that calls itself recursively).

Apparently this is because of the fact that Sum nodes are deeply nested. When I use a grammar that have something like Expr { (Number "+")* Number } there is no problem. The structure of the parse tree is different in this case, of course.

I suppose my question is whether exceptions like this are expected (after all, I can get a similar exception in Javascript, and it is not the language to blame, but me for writing bad code) and what can be done about that? In other words, grammars like this can be considered bad programming style, do you have any recommendations or comments about this?

thanks!
konstantin

marijn · December 1, 2023, 8:14am

Are you using @lezer/common 1.1.1? If not, does upgrading to that help?

shche · December 1, 2023, 12:28pm

Thank you. It did help partially.

I had to add

  "resolutions": {
    "@lezer/lr/@lezer/common": "^1.1.1"
  }

to package.json because my code does not use @lezer/common directly so it is not in my package.json, I use @lezer/lr which has a dependency on an earlier version of @lezer/common.

Now there is no stack overflow, but the resulting parse tree contains error nodes given sufficiently long input.

Specifically, if I use input like

const input = `${'1+'.repeat(100)}1`;

then the parse tree does not contain error nodes.
If I increase the length:

const input = `${'1+'.repeat(2000)}1`;

then I get error nodes in the tree.

marijn · December 1, 2023, 1:04pm

It depends on anything compatible with an earlier version. So if you just clear your package lock you should get the current version without adding a resolutions rule.

So yes, there is a limit to the depth of trees that can be handled by recursive functions without blowing the V8 stack (Spidermonkey has a much deeper stack limit). Both tree building and tree iteration, and many things user code might want to do with these trees, is done with recursive functions. So that’s why the tree building logic cuts off the depth at a point before those overflows happen.

shche · December 1, 2023, 7:34pm

Thank you very much for the help.

When a stack overflow exception is thrown, does the parser restart from the grammar start symbol or from its current state?

Also (I suppose this is also relevant to my other question at Why does this input generate no error?) is there a way to tell error nodes that indicate that the input does not match the grammar (when there is no transition in the table for some input token), from error nodes that are inserted in case of an exception or, as in the referenced post, when the parser restarts because there is more input?

marijn · December 2, 2023, 9:04am

In this case, the parser parses the content normally, and the flattening of the tree from a certain depth is happening in Tree.build. The parser itself also has a stack depth limit at which it stops going deeper, but because parse stack depth and tree depth don’t directly correspond, the tree builder has its own failsafe mechanism.

No actual stack overflow happens—the code just tracks recursion depth and stops at a depth that get close to the limit of current JS engines.

Error nodes created by depth-limits are not distinguishable from parse error nodes.

shche · December 2, 2023, 11:53am

If at some cut off depth you start flattening tree, what is the point of generating error nodes periodically? I see error nodes every so many nodes in the tree. Why not a single error node is generated and after that all nodes are flat?

marijn · December 2, 2023, 12:53pm

Then you may be seeing the depth limitation that the parser does after all—that one will abort a bunch of rules at the top of the stack, taking it back to some lower depth, and then continue parsing until it hits the limit again.

shche · December 3, 2023, 7:33pm

Thank you. As I understand, from a flattened tree I will not be able to deduce any helpful information about the structure of my input. Is this correct? If so, I am trying to understand how to write my client code… When I see an error node, usually I would highlight its range as incorrect and go on looking at the other nodes. But in this case, what do I do? How to tell that after a specific error node we have a flattened tree? If I could distinguish such error nodes, I would just ignore the rest of input and, say, not do any highlighting. The approach taken in the Lezer code is clear: you want to have degraded functionality even under low memory conditions. It is unclear to me how to use this flattened tree… Of course, this is unlikely that I will ever get to this point, my application will crash much earlier because of lack of memory, but we sometimes write deterministic code, so the logic must exist for all cases…

marijn · December 3, 2023, 8:05pm

Depending on how the language works, it would probably also require a truly pathological input to get to this point.

Or, if this is a construct that users of the language could conceivably repeat an enormous number of times, maybe it would be better to express it with a repeat operator (+ or *), rather than a recursive rule. That way, Lezer will build a balanced tree rather than a lopsided linked list structure.