Change recovery distance

Krupybalu · November 25, 2020, 9:51am

I am using lezer to add syntax highlighting and linting to codemirror6. There are some kind of errors for which the default recoveryDist is not enough to find the best route. After manually increasing this variable it finds the best route (without any noticable slow-down).

Is there a way to overwrite the default value from outside the module, so that after a fresh npm install it’ll still work?

marijn · November 25, 2020, 10:14am

The danger with a high recoverDist is in the behavior on nonsense input—did you benchmark how much things slow down if you feed your parser a completely unsyntactic document?

And can you provide a simplified example of the kind of error you’re trying to address?

Krupybalu · November 25, 2020, 11:05am

I’ve just run a few examples and it takes 0.5sec at most to highlight and identify every error in a source from a completely different language (with recoverDist=50, although i dont actually need such a high value).

I have a fairly complex language, but to simplify the problem: it has two different levels and the upper level has all the necessary directives and keywords. The upper level is more important, so i want to make sure that it is correct. E.g:

"stringcontent,
#if CONDITION
…
#endif

Because of the missing quotation mark, it will try to parse the whole thing after the first quotation mark as a string. My problem is that the upper level is much more important, so i would like to parse it correctly and only throw error in the second layer (like: " missing before #if).
With an increased recoveryDist it finds the best recovery.

marijn · November 25, 2020, 12:30pm

Without an indication of file size, that doesn’t tell me much.

What do your string grammar rules look like? If they consist of multiple tokens and allow newlines, I’m confused about how the parser will notice, before the end of the file, that something is wrong.

Krupybalu · November 25, 2020, 1:13pm

The avarage code size is quite small, few hundred lines.

They do not allow new lines, thats why it can detect that something is wrong after the new line and start the error recovery process.

marijn · November 26, 2020, 11:29am

You could work around that by making a newline terminate a string in your grammar, I guess.

But this is actually something recovery should handle—all it needs to do is skip out of the string rule, which should be one of the first things it tries. Could you reduce your grammar to the minimum set of rules needed to reproduce this behavior for me, so that I can debug why recovery isn’t doing the right thing?

Krupybalu · November 27, 2020, 1:17pm

Exactly. I think the ambiguity markers may cause the possible paths to be too deep and the recovery function can’t get to the best routes.

Sure, this is the best i could come up with:

@top ROOT { MainBlock }

MainBlock { SecondLayerBlock “\n” | If }

If { IfKW identifier “\n” MainBlock EndIfKW}
SecondLayerBlock { ( Declaration | ~a “\n”)* ~a }

Declaration { String “:” identifier “,”}
String { ‘"’ char* ‘"’ }

@skip { limited_whitespace }

@tokens{
IfKW {‘#if’}
EndIfKW {‘#endIf’}
identifier { $[a-zA-Z_] $[a-zA-Z0-9_]*}
limited_whitespace {$[ \r\t] }

char { $[\u{20}\u{21}\u{23}-\u{5b}\u{5d}-\u{10ffff}] }

@precedence { identifier, limited_whitespace }
@precedence { char, limited_whitespace }
}

With the following input:

Correct:

#if condition
“name”: value,
#endIf

Incorrect:

#if condition
"name: value,
#endIf

If i reduce the SecondLayerBlock so that it does not have ambiguities then lezer can recover the best possible tree in the incorrect example. Unfortunately, I have to somehow deal with empty lines in the SecondLayerBlock, but can’t declare it in a separate skip block, because the original language has recursive calls from the SecondLayerBlock to the MainBlock and i have to explicitly look for Newline characters in the MainBlock.

If i increase the recoverDist to 10, then lezer can recover the best tree from the incorrect example even with the ambiguous grammar.

Krupybalu · November 28, 2020, 3:07pm

Few corrections:

@precedence { identifier, limited_whitespace } is not needed, i left it in accidentally.

from recoverDist = 8 (and up) it can recover the best route

marijn · December 2, 2020, 11:14am

Hm, if I clean up (?) the grammar to remove the ambiguity (which seems unnecessary), the problem goes away. Here’s what I ended up with:

@top ROOT { MainBlock }

MainBlock { SecondLayerBlock | If }

If { IfKW identifier "\n" MainBlock EndIfKW}

SecondLayerBlock { (Declaration? "\n")* }

Declaration { String ":" identifier ","}
@skip {} {
  String { "\"" char* "\"" }
}

@skip { limited_whitespace }

@tokens {
  IfKW {"#if"}
  EndIfKW {"#endIf"}

  identifier { $[a-zA-Z_] $[a-zA-Z0-9_]*}
  limited_whitespace { $[ \r\t]+ }

  char { $[\u{20}\u{21}\u{23}-\u{5b}\u{5d}-\u{10ffff}] }
}

I put a @skip {} block around the string rule, since in there it shouldn’t skip anything (which allows me to also remove the second @precedence declaration), and I moved the trailing newline into SecondLayerBlock to get rid of the conflict. Did I break the meaning of the grammar that way?

Krupybalu · December 2, 2020, 12:00pm

The string part of the clean up does not break anything.
The newline part does break it though. Declarations are allowed to be in the same line, but in this grammar at least one newline character must be between them.

Few other notes (they are correct in the cleaned up grammar):

Multiple newline characters can follow each other and this must be processed by the SecondLayerBlock.

At least one newline character must be before the If nonterminal.

I forgot to mentiont it in the grammar, but an If must end with a newline character

marijn · December 2, 2020, 12:37pm

Declarations are allowed to be in the same line

Ah, right. Would changing the SecondLayerBlock to (Declaration* "\n")+ solve these? It sounds like your language should be expressible without ambiguities.

Krupybalu · December 2, 2020, 4:17pm

Unfortunately, this also breaks the original grammar.
The problem is, that the SecondLayerBlock is called elsewhere as well, and the trailing newline is optional in that rule.

So in a MainBlock, the SecondLayerBlock must end with a newline character, while in other rules it can end with something else.