Need help on reduce/reduce ambiguity

hlolli · October 25, 2022, 12:21pm

Hi all, I spent few days now stuck on getting grammar rules to match a language grammar I’m working on. The language is called Csound and it has some unconventional ways it can make a function statement. It can in fact call a function with or without parenthesis, and it support comma seperated arguments as input and output parameters of a function call. And this is where Lezer is tripping me out, even with ambiguity char ~ I still can’t get this to work, so Im hoping to tap into some expert advice here.

In Csound, a function has the archaic name Opcode, which I’ll use in this example.

You can call Opcode within an instr block like this with arguments ending with a newline that terminates the statement.

instr 1
sum 1, 2, 3
endin

this is easy enough and will work with grammar like this (note that Im leaving out some rules to make it shorter), note as well that variables in csound I’m calling SignalRateIdentifier, and those for example are the only identifier type allowed in opcode outputs.

@top Program { rootstatement* }

@skip { BlockComment | LineComment | space }

rootstatement {
  InstrumentDeclaration | newline
}

statement {
  OpcodeStatement
}

OpcodeStatement {
  Opcode expressionCommaSep newline
}

opcodeOutputExpression { commaSep<SignalRateIdentifier> }

expressionNoComma {
  Number |
  String |
  SignalRateIdentifier |
  ParenthesizedExpression { "(" expressionNoComma ")" } |
  BinaryExpression {
    expressionNoComma !arithOpR arithOpRight expressionNoComma |
    expressionNoComma !arithOpL arithOpLeft expressionNoComma
  } |
  ConditionalExpression { 
    expressionNoComma !ternary "?" expressionNoComma ":" expressionNoComma 
  } |
  CallbackExpression { Opcode !call ArgList }
}

expressionCommaSep { commaSep< expressionNoComma> }

InstrumentDeclaration {
  kw<"instr"> (FunctionName | Number) newline
  statement*
  InstrumentEnd
}

InstrumentEnd {
  kw<"endin">
}

FunctionName { word }

SignalRateIdentifier { identifier }

Opcode { identifier }

kw<term> { @specialize[@name={term}]<identifier, term> }

@tokens {

  LineComment { ";" ![\n]* | "//" ![\n]* }

  BlockComment { "/*" blockCommentRest }

  blockCommentRest { ![*] blockCommentRest | "*" blockCommentAfterStar }

  blockCommentAfterStar { "/" | "*" blockCommentAfterStar | ![/*] blockCommentRest }

  identifierChar { @asciiLetter | $[_$\u{a1}-\u{10ffff}] }

  Number { ("+" | "-")? @digit? "."? @digit+ |  ("+" | "-")? @digit+ "."? @digit* }

  word { identifierChar (identifierChar | @digit)* }

  identifier { word (":" $[iak])?  }

  @precedence { spaces, newline, identifier }

  @precedence { spaces, newline, opcodeIdentifier }

  @precedence { spaces, newline, word }

  newline { $[\n\r]+ }
  space { $[ \t]+ }

So here comes the ambiguity that I’ve been stuck on, csound allows for the following grammar, which trigger reduce/reduce errors in Lezer.

instr 1
signalRate opcode
signalRate1, signalRate2 opcode 1, 2, opcode(1 + 2)
opcode signalRate1, signalRate2, (1 + (2 * 3))
opcode
endin

My emphasis is on the statement rule itself, of opcode statement with either input and output, just either input or output, and a single opcode statement without input or output arguments.

This is my attempt to catch this pattern which fails, and maybe this is not even possible?

OpcodeStatement {
  opcodeOutputExpression Opcode expressionCommaSep newline |
  Opcode expressionCommaSep newline |
  opcodeOutputExpression Opcode newline |
  Opcode newline
}

the error usually complains about SignalRateIdentifier and Opcode conflicting with identifier token. So I try applying ~ operator, but that seems to confuse Lezer even more and it starts to give even worse results.

There’s a lot of context here but I’d be happy with even tiniest hint or tip!

marijn · October 25, 2022, 3:37pm

That grammar is incomplete, so I can’t compile it. But at a glance, it seems like the problem is that when the parser sees an identifier followed by an opening paren, it doesn’t know whether to reduce SignalRateIdentifier or Opcode. Try to think in terms of a parser that has to make decisions on what to do next with only one token of lookahead, and structure your grammar in such a way that these decisions can be made.

hlolli · October 26, 2022, 12:07am

Here’s a complete example that fails for the reasons you correctly guessed. I would think that with one token lookahead, this would work. Given that the presence of a comma should give away the difference between Opcode and SignalRateIdentifier. But I may be overseeing something simple.

@dialects { csd }

@top Program { rootstatement* }

@skip { BlockComment | LineComment | space }

@precedence {
  comma @left,
  arithOpL @left,
  arithOpR @right,
  opcodeargs @left,
  call,
  ternary @left
  
}

rootstatement {
  InstrumentDeclaration | newline
}

statement {
  OpcodeStatement | 
  AssignStatement |
  GotoStatement |
  ControlFlowStatement
}

GotoStatement {
  GotoLabel { identifier ":" } newline
}


InstrumentDeclaration {
  kw<"instr"> (FunctionName | Number) newline
  statement*
  InstrumentEnd
}

InstrumentEnd {
  kw<"endin">
}

AssignStatement {
  SignalRateIdentifier AssignOp expressionNoComma newline
}

expressionNoComma {
  Number |
  String |
  SignalRateIdentifier |
  ParenthesizedExpression { "(" expressionNoComma ")" } |
  BinaryExpression {
    expressionNoComma !arithOpR arithOpRight expressionNoComma |
    expressionNoComma !arithOpL arithOpLeft expressionNoComma
  } |
  ConditionalExpression { 
    expressionNoComma !ternary "?" expressionNoComma ":" expressionNoComma 
  } |
  CallbackExpression { Opcode !call ArgList }
}

expressionCommaSep { commaSep<expressionNoComma> }

opcodeOutputExpression {
  commaSep<SignalRateIdentifier>
}

ArgList {
  "(" commaSep<expressionNoComma> ")" |
  "(" ")"
}

OpcodeStatement {
  opcodeOutputExpression Opcode !opcodeargs expressionCommaSep newline |
  Opcode !opcodeargs expressionCommaSep newline
}

ControlFlowStatement {
  controlFlowBeginToken expressionNoComma controlFlowDoToken newline
  statement*
  controlFlowEndToken |
  controlFlowBeginToken expressionNoComma controlFlowGotoToken
  DeclareGotoLabel { word } newline
}

FunctionName { word }

SignalRateIdentifier { identifier }

Opcode { identifier }

AssignOp {
  kw<"init"> | assigners
}

@tokens {

  LineComment { ";" ![\n]* | "//" ![\n]* }

  BlockComment { "/*" blockCommentRest }

  blockCommentRest { ![*] blockCommentRest | "*" blockCommentAfterStar }

  blockCommentAfterStar { "/" | "*" blockCommentAfterStar | ![/*] blockCommentRest }

  identifierChar { @asciiLetter | $[_$\u{a1}-\u{10ffff}] }

  Number { ("+" | "-")? @digit? "."? @digit+ |  ("+" | "-")? @digit+ "."? @digit* }

  word { identifierChar (identifierChar | @digit)* }

  identifier { word (":" $[iak])? }

  @precedence { spaces, newline, identifier }

  @precedence { spaces, newline, opcodeIdentifier }

  @precedence { spaces, newline, word }

  newline { $[\n\r]+ }
  space { $[ \t]+ }
  String { '"' (![\\\n"] | "\\" _)* '"'? }

  assigners {
    "="  |
    "+=" |
    "-=" |
    "*=" |
    "/=" |
    "|=" |
    "&="
  }

  arithOpRight {
    "~" |
    "!" |
    "¬"
  }

  arithOpLeft {
    "/"  |
    "+"  |
    "-"  |
    "*"  |
    "%"  |
    "^"  |
    "#"  |
    "&"  |
    "|"  |
    "&&" |
    "||" |
    "<"  |
    ">"  |
    "<=" |
    ">=" |
    "==" |
    "!=" |
    ">>" |
    "<<"
  }

  @precedence { arithOpRight, arithOpLeft, assigners, Number, identifier }
  @precedence { BlockComment, LineComment, arithOpLeft }

  controlFlowBeginToken {
    "if" ![a-z] |
    "while"
  }

  controlFlowDoToken {
    "then" |
    "do"
  }

  controlFlowGotoToken {
    "goto" |
    "igoto" |
    "kgoto"
  }

  controlFlowEndToken {
    "endif" |
    "od"
  }

  @precedence { controlFlowBeginToken, controlFlowDoToken, controlFlowGotoToken, controlFlowEndToken, identifier }

}


// Keywords

kw<term> { @specialize[@name={term}]<identifier, term> }


commaSep<content> {
  content (!comma "," content?)*
}

@detectDelim

If I change

commaSep<content> {
  content (!comma "," content?)*
}

to

commaSep<content> {
  content (!comma "," content?)+
}

it will build, but it will produce wrong tree from the following example

instr 705
linseg iamp, iatt, p4, ilen, p4, idec, p7
amp1 linseg iamp, iatt, p4, ilen, p4, idec, p7
amp1, amp2, amp3 linseg iamp, iatt, p4, ilen, p4, idec, p7
endin

It gives me

Program [1:0..7:0]
 └─ InstrumentDeclaration [2:0..6:5]
     ├─ instr [2:0..2:5]: "instr"
     ├─ Number [2:6..2:9]: "705"
     ├─ OpcodeStatement [3:0..4:0]
     │   ├─ Opcode [3:0..3:6]: "linseg"
     │   ├─ SignalRateIdentifier [3:7..3:11]: "iamp"
     │   ├─ SignalRateIdentifier [3:13..3:17]: "iatt"
     │   ├─ SignalRateIdentifier [3:19..3:21]: "p4"
     │   ├─ SignalRateIdentifier [3:23..3:27]: "ilen"
     │   ├─ SignalRateIdentifier [3:29..3:31]: "p4"
     │   ├─ SignalRateIdentifier [3:33..3:37]: "idec"
     │   └─ SignalRateIdentifier [3:39..3:41]: "p7"
     ├─ OpcodeStatement [4:0..5:0]
     │   ├─ Opcode [4:0..4:4]: "amp1"
     │   ├─ SignalRateIdentifier [4:5..4:16]
     │   │   └─ ⚠ [4:12..4:16]: "iamp"
     │   ├─ SignalRateIdentifier [4:18..4:22]: "iatt"
     │   ├─ SignalRateIdentifier [4:24..4:26]: "p4"
     │   ├─ SignalRateIdentifier [4:28..4:32]: "ilen"
     │   ├─ SignalRateIdentifier [4:34..4:36]: "p4"
     │   ├─ SignalRateIdentifier [4:38..4:42]: "idec"
     │   └─ SignalRateIdentifier [4:44..4:46]: "p7"
     ├─ OpcodeStatement [5:0..5:17]
     │   ├─ SignalRateIdentifier [5:0..5:4]: "amp1"
     │   ├─ Opcode [5:6..5:10]: "amp2"
     │   ├─ ⚠ [5:10..5:11]: ","
     │   ├─ SignalRateIdentifier [5:12..5:17]: "amp3 "
     │   └─ ⚠ 5:17
     ├─ OpcodeStatement [5:17..6:0]
     │   ├─ Opcode [5:17..5:23]: "linseg"
     │   ├─ SignalRateIdentifier [5:24..5:28]: "iamp"
     │   ├─ SignalRateIdentifier [5:30..5:34]: "iatt"
     │   ├─ SignalRateIdentifier [5:36..5:38]: "p4"
     │   ├─ SignalRateIdentifier [5:40..5:44]: "ilen"
     │   ├─ SignalRateIdentifier [5:46..5:48]: "p4"
     │   ├─ SignalRateIdentifier [5:50..5:54]: "idec"
     │   └─ SignalRateIdentifier [5:56..5:58]: "p7"
     └─ InstrumentEnd [6:0..6:5]
         └─ endin [6:0..6:5]: "endin"

Where I want to get

Program [1:0..7:0]
 └─ InstrumentDeclaration [2:0..6:5]
     ├─ instr [2:0..2:5]: "instr"
     ├─ Number [2:6..2:9]: "705"
     ├─ OpcodeStatement [3:0..4:0]
     │   ├─ Opcode [3:0..3:6]: "linseg"
     │   ├─ SignalRateIdentifier [3:7..3:11]: "iamp"
     │   ├─ SignalRateIdentifier [3:13..3:17]: "iatt"
     │   ├─ SignalRateIdentifier [3:19..3:21]: "p4"
     │   ├─ SignalRateIdentifier [3:23..3:27]: "ilen"
     │   ├─ SignalRateIdentifier [3:29..3:31]: "p4"
     │   ├─ SignalRateIdentifier [3:33..3:37]: "idec"
     │   └─ SignalRateIdentifier [3:39..3:41]: "p7"
     ├─ OpcodeStatement [4:0..5:0]
     │   ├─ SignalRateIdentifier [4:0..4:4]: "amp1"
     │   ├─Opcode [4:5..4:16]: "linseg"
     │   |─-SignalRateIdentifier [4:12..4:16]: "iamp"
     │   ├─ SignalRateIdentifier [4:18..4:22]: "iatt"
     │   ├─ SignalRateIdentifier [4:24..4:26]: "p4"
     │   ├─ SignalRateIdentifier [4:28..4:32]: "ilen"
     │   ├─ SignalRateIdentifier [4:34..4:36]: "p4"
     │   ├─ SignalRateIdentifier [4:38..4:42]: "idec"
     │   └─ SignalRateIdentifier [4:44..4:46]: "p7"
     ├─ OpcodeStatement [5:0..5:17]
     │   ├─ SignalRateIdentifier [5:0..5:4]: "amp1"
     │   ├─ SignalRateIdentifier [5:6..5:10]: "amp2"
     │   ├─ SignalRateIdentifier [5:12..5:17]: "amp3"
     │   ├─ Opcode [5:17..5:23]: "linseg"
     │   ├─ SignalRateIdentifier [5:24..5:28]: "iamp"
     │   ├─ SignalRateIdentifier [5:30..5:34]: "iatt"
     │   ├─ SignalRateIdentifier [5:36..5:38]: "p4"
     │   ├─ SignalRateIdentifier [5:40..5:44]: "ilen"
     │   ├─ SignalRateIdentifier [5:46..5:48]: "p4"
     │   ├─ SignalRateIdentifier [5:50..5:54]: "idec"
     │   └─ SignalRateIdentifier [5:56..5:58]: "p7"
     └─ InstrumentEnd [6:0..6:5]
         └─ endin [6:0..6:5]: "endin"

marijn · October 26, 2022, 8:48am

I don’t think this is related to the comma. You can suppress it by requiring a comma after commaSep, but as you noticed, that isn’t actually what you want to express here.

Both AssignStatement and OpcodeStatement can start with two identifiers in a row, but in the former the first identifier must be reduced to SignalRateIdentifier, in the second to Opcode, and given that the token context looks the same, the parser cannot make the decision which one to reduce.

hlolli · October 26, 2022, 10:37am

@marijn

Related to AssignStatement, if I modify

statement {
  OpcodeStatement | 
  AssignStatement |
  GotoStatement |
  ControlFlowStatement
}

to

statement {
  OpcodeStatement
}

I still get

Error: reduce/reduce conflict between
  SignalRateIdentifier -> identifier
and
  Opcode -> identifier
With input:
  identifier/"instr" FunctionName newline identifier · identifier …
Shared origin: statement+ -> · OpcodeStatement
  via OpcodeStatement -> · commaSep<SignalRateIdentifier> Opcode commaSep<expressionNoComma> newline
    via commaSep<SignalRateIdentifier> -> · SignalRateIdentifier ("," SignalRateIdentifier?)+
      SignalRateIdentifier -> identifier ·
  via OpcodeStatement -> · Opcode commaSep<expressionNoComma> newline
    Opcode -> identifier ·

I understand that the tokens OpcodeStatement and SignalRateIdentifier can look the same. In theory it should always be possible to disambiguate between the two. I’m just missing a clever way to achieve it. I’m totally open for the idea that Lezer isn’t the right tool here to create codemirror6 syntax highlighting, in which case I can continue using the old-style callback tokinizer, but I hope I can figure this out with Lezer

marijn · October 26, 2022, 10:58am

You seem to have the same problem in OpcodeStatement – both of the alternatives can start with two identifiers: either a single opcodeOutputExpression followed by an OpCode (which will have to be reduced for a sequence in commaSep), or OpCode followed by expressionCommaSep. I’d say this stuff is probably parseable by an LR parser, but you’ll have to understand its limitations and structure the grammar around those. Or enable GLR parsing all over the place.

hlolli · October 26, 2022, 11:49am

It just dawned on me one obvious ambiguity

that for either single output or single input, both are valid in Csound

SignalRateIdentifier Opcode newline
Opcode SignalRateIdentifier newline

this is impossible for a parser to know which is which. I think it’s unavoidable that I tag these manually with the list of all known opcodes and mark them appropriately. I guess I could have the parser correctly guess some combination of input and output expressions, but I will tag these specifically as ambiguous and resolve the tag after parser has done its best job, and only look up the ambiguous parts specifically.

Thanks for your help, it’s getting more clear to me!