Generation of parser from grammar is slow

danbulant · February 13, 2024, 11:27pm

I’m in the process of writing an SurrealQL parser for lezer, and I have some recursion in SELECT statement. That brings the JS generation step to almost a stop, it seems like it grows exponentionally, as even a simple recursion added will make the time required much longer.

gist.github.com

https://gist.github.com/danbulant/df97700e41a1ef7d605a07a987c5e5a9

surrealql.grammar

@top Script {
    (statement SEMI)*
    statement
}

@top SingleStatement {
    statement
}

@precedence {

This file has been truncated. show original

Here’s a link to github gist with the relevant code, mainly lines 185-200 (SelectStatement). This code compiles on my machine in 50 seconds. Commenting out timeout (line 198) brins it to 27s. Some parts of the code are commented out so I don’t have to wait too long for it…

This seems odd to me, as I was mainly working based on the JS parser, which seems to be much more complicated that this one, so why it’s so slow? Do I do the recursion part wrong?

Also, a general review would be appreciated too as it’s my first time writing grammar, and I’m working just on surrealql (incomplete) docs, JS grammar file and lezer docs.

marijn · February 14, 2024, 3:00pm

The way ? and | operators are implemented — by generating multiple permutations of the surrounding rule — means that something like your SelectStatement rule, with a lot of ? operators, is going to generated a lot of alternatives for that rule (I count 11, so at least 2048 alternatives). If I remove that, compilation becomes very fast. You can also factor part of it into different rules, like maybeValue { kw<"value">? }. I had a similar issue in the JS grammar when I added optional decorators to a whole bunch of rules, duplicating the variants generated for all of them, until I made a wrapper rule for that. Your situation is a bit less easier to handle elegantly, but just splitting that rule in two or using some wrappers like that should help a lot.

It may be worth adding a warning to lezer-generator when a single rule is generating a ridiculous amount of alternatives.

Putting the big pile of compare operators into a single token will also make the grammer smaller.

marijn · February 14, 2024, 4:14pm

This commit adds a warning for this