Inconsistent skip set after Lezer generator upgrade

zampino · June 28, 2022, 2:00pm

Hello Lezerians ,

At nextjournal we’re trying to upgrade lezer-clojure to Lezer 1.0.0 (Upgrade to Lezer system 1.0.0 by zampino · Pull Request #18 · lezer-parser/clojure · GitHub) from lezer 0.13.4. After straightforward migrations in code we end up with an error of the Inconsistent Skip Sets family. I can reproduce the issue in the following minimal grammar:

@top Program { expression+ }
@skip { whitespace | Skipped }
Skipped { "#_" expression }

expression { A | B | ReaderTag | Constructor }
A { "a" }
B { "b" }

@skip{}{
  ReaderTag { "#" ident ~amb }
  prefix { "#" ident ~amb }
}

Constructor { prefix B }

@tokens {
  "a" "b"
  ident { "*" }
  whitespace { std.whitespace+ }
}
@detectDelim

The grammar builds with lezer-generator src/mini.grammar -o src/mini and the tests below are passing under lezer-generator v0.13.4

import {parser} from './src/mini.js'
let p = parser.configure({strict: true})
// import {testTree} from "@lezer/generator/dist/test"
import {testTree} from "lezer-generator/dist/test"

testTree(p.parse("a"),           "Program(A)")
testTree(p.parse("a #_ a"),      "Program(A, Skipped(A))")
testTree(p.parse("a #_ #_ b b"), "Program(A, Skipped(Skipped(B), B))")
testTree(p.parse("a #* a"),      "Program(A, ReaderTag, A)")
testTree(p.parse("a #* b"),      "Program(A, Constructor( B ))")

but the grammar is not building in @lezer/generator 1.0.0 with error

Inconsistent skip sets after "#_" "#" ident

where the offending expression is Constructor: if we remove it from the valid expressions, we can actually generate a parser and tests pass except for the Constructor one of course.

Any hint at what could have changed or how to work-around this?

marijn · June 28, 2022, 8:47pm

I’m not sure which change causes this, but it seems like this is the result of the way the Skipped production doesn’t, after seeing "#" ident, know whether it is finished (it may be after a prefix, waiting for B, or done, with a ReaderTag), and thus can’t know whether it should match the top level skip rules at this point—if it’s done, it shouldn’t, since we don’t want skip rules to go and consume stuff outside of their extent, if it’s still parsing a Constructor it should. As such, this seems like it is probably the result of a bugfix. You generally want skip expressions to have a definite end point, so that they unambiguously know where to finish (reduce).

zampino · June 29, 2022, 8:11am

Thank you @marijn for the clarification! Added some tests in favour of your argument

testTree(p.parse("a #_ #* a"),   "Program(A, Skipped(ReaderTag), A)")
testTree(p.parse("a #_ #* b"),   "Program(A, Skipped(ReaderTag), B)")

these both pass (v0.13.4) in non-strict mode, but throw a no-parse at runtime otherwise.

We might revisit things around Constructor and ReaderTag and don’t treat ReaderTag as an expression by itself. If we still want a safer skip maybe we should use an external tokenizer based on a clojure reader I guess.