Precedence with @extend

I’m working on a grammar for Clojure and I would like to detect and highlight variable names. I’ve created a minimal grammar to illustrate the problem:

@top[name=Program] { expression* }

@skip { whitespace }
expression { Symbol | List }

List { "(" (DefLike VarName expression? | expression*) ")" }
VarName { Symbol }

@tokens {
"("
")"
whitespace { std.whitespace }
Symbol { std.asciiLetter+ }
}

DefLike { @extend<Symbol, "def" | "defn"> }

@detectDelim

A DefLike token may only appear at the beginning of a list, so I’m using both meanings.

Running this grammar agains the following tests produces one failure:

# Add
(hello world)
==>
Program(List(Symbol,Symbol))

# Def 
(def foo bar)
==> Program(List(DefLike,VarName(Symbol),Symbol))

# Def Defn
(def defn foo)
==> Program(List(DefLike,VarName(Symbol),Symbol))

# Def Defn 2
(def defn foo bar)
==> Program(List(DefLike,VarName(Symbol),Symbol,Symbol))

Only the last test fails:

expression
    ✓ Add
    ✓ Def 
    ✓ Def Defn
    1) Def Defn 2


3 passing (7ms)
1 failing

1) expression
    Def Defn 2:
    Error: Expected DefLike in List, got Symbol at 1 
Program(List("(",Symbol,Symbol,Symbol,Symbol,")"))

I found it suprising to see the last test fail, but the test before it pass. It seems that my grammar is not deterministic in what tree it produces. Notices that the only difference between the last two tests is the addition of a symbol at the end of the list.

Is there a better way to specify what I want without this ambiguity? I also tried using an external tokenizer and tried considering the stack to only consider a DefLike token at the beginning of a list but I couldn’t get that to work.

Also wondering if lezer-generator should output a warning in this case. I only discovered this problem after using lezer interactively as my grammar tests were actually passing.

A running version of this minimal grammar can be found at https://github.com/nextjournal/lezer-clojure/tree/minimal-def

GLR (which you implicitly enable with @extend — as opposed to @specialize) will run multiple parses alongside each other and pick any one of them when it finishes at the end of the input.

But I’m not sure that’s the problem here. Your last input doesn’t match DefLike VarName expression?, so the output you get is the only reasonable one.

Indeed, I made a mistake simplifying the grammar. I’ve changed it to be DefLike VarName expression+ and now indeed all tests pass. Which means I’m struggling to reproduce the issue with just lezer.

It does however still occur on my interactive demo at https://nextjournal.github.io/codemirror.next-clojure/, here’s a git of what I mean:
CleanShot 2020-09-07 at 14.24.23

Notice how writing the # seems to make lezer consider the alternative parse where it highlights the VarName. I have a passing test for this exact input.

Is this what I’m observing here? So both parses are valid and it doesn’t know which one should take precedence? If I’m using @specialize the grammar fails to parse cases like e.g. (def defn …) where it finds a DefnLike token in an invalid position.

Possibly. If you’re using lezer 0.10.2 and lezer-generator 0.10.2 you can add [dynamicPrecedence=1] to rules to give them priority when multiple parses match.

That works. Thanks so much for you help!

I was a bit surprised that it was enough to put it on DefLike as opposed to defList. Not sure if it’s sensible to give precedence to “extended” meanings by default?