Question on disambiguation with undelimited sequences

Hi!

I am working on a personal language project with Lezer. While I’ve gotten something to work, it’s hacky and I was wondering about the preferred approach for expressing something like below.

variable : type = "value"  // -> Assign( VarName OptionalType String )
variable                   // -> ExprStmt(VarName)
                                   
variable with              // -> ExprStmt( VarName WithClause( 
  key1 : "value1"          //        Key Value(String) 
  key2 : "value2"          //        Key Value(String) 
  key3 : "value3"          //        Key Value(String) )


variable // -> ExprStmt(VarName)

The main problem is to disambiguate a variable name vs. a Key in the with clause sequence without using an end delimiter for with clause (also right now I’m skipping spaces and new lines if that matters). I had the grammar below, which has an identifier token, and a labeled rule which looks for !label Key{identifier} ":" Value{expression} after a with keyword. (The !label precedence marker is needed to resolve a shift-reduce conflict.) However, this will force succeeding identifiers to be a Key.

Grammar 1
@top Script { statement* }

statement {
  ExprStmt{ expression }  |
  Assign 
}

expression {
  String | 
  VarName{identifier} |
  expression 
     !withL WithClause{@specialize[@name=With]<identifier, "with"> labeled+ }
}
Assign {
  VarName{identifier} ( ':' OptionalType{ identifier})?
    "="  expression 
}

@precedence {
  withL @left
  label @left
}

// the label prec removes the shift/reduce conflict, but forces all succeeding identifiers to be 
labeled { !label Key{identifier} ':' Value{expression} }

@skip { space | newline }
@tokens {
  space { $[ \t]+ }
  newline { $[\r\n] }
  
  String { '"' (![\\\n"] | "\\" _)* '"' }
  
  identifierChar { @asciiLetter }
  identifier { identifierChar (identifierChar | @digit)* }  
}
variable with              // ExprStmt( VarName WithClause( 
  key1 : "value1"          //        Key Value(String) 
  key2 : "value2"          //        Key Value(String) 
  key3 : "value3"          //        Key Value(String) 

variable // -> still a Key instead of the intended variable name

Because of this, I added a label token with a higher precedence than identifier, which is just identifier ":", and replaced the identifier token with the label token in the with clause (snippet below, full grammar at the bottom of post).

expression {
  String | VarName{identifier} |
  expression !withL WithClause{With labeled+ }
}

With{@specialize< identifier, "with">}
labeled { !label Key{label} Value{expression} } // <- use label token instead of identifier


@tokens {
  // ...
  identifier { identifierChar (identifierChar | @digit)* }  
  label { identifier space? ":"}
  @precedence {label, identifier}
}

This “works” for the with clause, but now the colon is inside the Key{label} node, and has some other side effects like needing additional handling for variable type declarations (I collapsed the details below if you’re interested). So I was wondering if there’s a cleaner way to accomplish this within the grammar, or should I resort to an external tokenizer for this (or just give up and introduce delimiters :D)? Thank you in advance!

variable with              // ExprStmt( VarName WithClause( 
   key1 : "value1"          //        Key Value(String) 
   key2 : "value2"          //        Key Value(String) 
   key3 : "value3"          //        Key Value(String) 
//^^^^^^^ - Key nodes now include the colon because it uses the label token
variable // -> But at least this is now an ExprStmt(VarName)
Side effect on variable type declaration

Using the label {identifier ":"} would interfere with something like a type declaration following a VarName{identifier} rule because the label token has a higher precedence than the identifier.

variable : optionalType = "string"
^^^^^^^^^^ - // this range is now  a label token, so VarName{identifier}
             //doesn't kick in

To handle that I introduced a new condition for variable assignment, one for an identifier token and another for label token, and this again works as intended, but of course it’s hackish, and again the colon tokens becomes part VarName{label} nodes.

Assign {
  ( VarName{identifier} 
    |
    VarName{label} OptionalType{ identifier} )  
   "="  expression 
}
Grammar 2
@top Script { statement* }

@skip { space | newline }

@precedence {
  withL @left 
  label @left
}

statement {
  ExprStm{ expression }  |
  Assign 
}

expression {
  String | VarName{identifier} |
  expression !withL WithClause{With labeled+ }
}


With{@specialize< identifier, "with">}
// use label token for With instead of identifier
labeled { !label Key{label} Value{expression} } 

Assign {
  ( VarName{identifier} 
    |
    VarName{label} OptionalType{ identifier} )  
   "="  expression 
}

@tokens {
  space { $[ \t]+ }
  newline { $[\r\n] }
  
  String { '"' (![\\\n"] | "\\" _)* '"' }
  
  identifierChar { @asciiLetter }
  identifier { identifierChar (identifierChar | @digit)* }  

  label { identifier space? ":"} // <-- new label token
  @precedence {label, identifier}
}

Edit: I think I’ve properly interpreted the shift-reduce conflict I mentioned above that !label tries to solve, and introduced ~with ambiguity markers in (seemingly) correct where !label used to be. The grammar below now works without requiring the special label token, but would still love to hear your thoughts on this approach, especially if maybe the ambiguity marker is unneeded. Thank you!

New grammar with ambiguity markers instead of a label token and precedence marker
@top Script { statement* }

statement {
  ExprStmt{ expression }  |
  Assign 
}

expression {
  String | 
  VarName{identifier} |
  expression 
     ~with WithClause{@specialize[@name=With]<identifier, "with"> labeled+ }
}
Assign {
  VarName{identifier} ( ':' OptionalType{ identifier})?
    "="  expression 
}

@precedence {
  withL @left
  label @left
}

// the label prec removes the shift/reduce conflict, but forces all succeeding identifiers to be 
labeled { ~with Key{identifier} ':' Value{expression} }

@skip { space | newline }
@tokens {
  space { $[ \t]+ }
  newline { $[\r\n] }
  
  String { '"' (![\\\n"] | "\\" _)* '"' }
  
  identifierChar { @asciiLetter }
  identifier { identifierChar (identifierChar | @digit)* }  
}

If this is a language you’re designing, parsing issues like this might be an indication that the syntax is ambiguous and needs more work. If there’s no indication that a with block ends except that subsequent syntax isn’t valid, that’s maybe not ideal?

In any case, yes, you can either use GLR parsing, or structure the grammar in such a way that it only needs to make a different shift or reduce at the point where the ambiguity ends, but in this case, that’d require variable : type and key : value to have the same internal parse structure, which may be weird.

Thanks for your inputs! Yeah it’s supposed to be a small DSL grammar. It looks like Haskell has some similar syntax but uses indentation to resolve the ambiguities. A quick follow-up: If I were to go the custom label token route, may I confirm if it is possible for token rules referenced by other token rules to appear in the output tree like @label{ KeyToken ":"}? It’s probably not intended but perhaps I might be missing something too :sweat_smile:

Then you’ll want to encode that in your grammar, and break the conflict that way, no?

I don’t quite understand your other question. Tokens do appear in the tree, when their names start with a capital letter or they provide an explicit @name, regardless of how they are referenced.

Yeah that’s probably the best way indeed. I guess I’m also just curious about lezer in general :smiley: Regarding the other question: If I have a @tokens block like below, would it be possible for the identifier referenced inside label to be named and appear on the tree? Or is that only allowed in production rules?

@tokens {
  ...
  identifier { ... }  
  label { identifier space? ":"} 
  @precedence {label, identifier}
}

If the identifier type should be used for only a single node type in the tree, just call it Identifier and it’ll show up wherever it’s used. If you want to use it for different types of tree nodes, the easiest pattern is to define wrapping nonterminals (TypeName { identifier } VariableName { identifer } etc) and use those in your rules.

1 Like

Got it! Thank you very much for your responses and your work on these amazing tools!