Inconsistent skip set after Lezer generator upgrade

Hello Lezerians :wave:,

At nextjournal we鈥檙e trying to upgrade lezer-clojure to Lezer 1.0.0 (Upgrade to Lezer system 1.0.0 by zampino 路 Pull Request #18 路 lezer-parser/clojure 路 GitHub) from lezer 0.13.4. After straightforward migrations in code we end up with an error of the Inconsistent Skip Sets family. I can reproduce the issue in the following minimal grammar:

@top Program { expression+ }
@skip { whitespace | Skipped }
Skipped { "#_" expression }

expression { A | B | ReaderTag | Constructor }
A { "a" }
B { "b" }

@skip{}{
  ReaderTag { "#" ident ~amb }
  prefix { "#" ident ~amb }
}

Constructor { prefix B }

@tokens {
  "a" "b"
  ident { "*" }
  whitespace { std.whitespace+ }
}
@detectDelim

The grammar builds with lezer-generator src/mini.grammar -o src/mini and the tests below are passing under lezer-generator v0.13.4

import {parser} from './src/mini.js'
let p = parser.configure({strict: true})
// import {testTree} from "@lezer/generator/dist/test"
import {testTree} from "lezer-generator/dist/test"

testTree(p.parse("a"),           "Program(A)")
testTree(p.parse("a #_ a"),      "Program(A, Skipped(A))")
testTree(p.parse("a #_ #_ b b"), "Program(A, Skipped(Skipped(B), B))")
testTree(p.parse("a #* a"),      "Program(A, ReaderTag, A)")
testTree(p.parse("a #* b"),      "Program(A, Constructor( B ))")

but the grammar is not building in @lezer/generator 1.0.0 with error

Inconsistent skip sets after "#_" "#" ident

where the offending expression is Constructor: if we remove it from the valid expressions, we can actually generate a parser and tests pass except for the Constructor one of course.

Any hint at what could have changed or how to work-around this?

I鈥檓 not sure which change causes this, but it seems like this is the result of the way the Skipped production doesn鈥檛, after seeing "#" ident, know whether it is finished (it may be after a prefix, waiting for B, or done, with a ReaderTag), and thus can鈥檛 know whether it should match the top level skip rules at this point鈥攊f it鈥檚 done, it shouldn鈥檛, since we don鈥檛 want skip rules to go and consume stuff outside of their extent, if it鈥檚 still parsing a Constructor it should. As such, this seems like it is probably the result of a bugfix. You generally want skip expressions to have a definite end point, so that they unambiguously know where to finish (reduce).

Thank you @marijn for the clarification! Added some tests in favour of your argument

testTree(p.parse("a #_ #* a"),   "Program(A, Skipped(ReaderTag), A)")
testTree(p.parse("a #_ #* b"),   "Program(A, Skipped(ReaderTag), B)")

these both pass (v0.13.4) in non-strict mode, but throw a no-parse at runtime otherwise.

We might revisit things around Constructor and ReaderTag and don鈥檛 treat ReaderTag as an expression by itself. If we still want a safer skip maybe we should use an external tokenizer based on a clojure reader I guess.