Overlapping tokens in lezer

Hi,

I’m new to Lezer and writing grammars. What I want to do is write a simple grammar that can parse a basic assembler dialect (for educational purposes).

Here’s an example of the test I’d like to work:

# Lines

symbol1: CMP
symbol2:
MOV

==>

Program(
  Line(Symbol Command),
  Line(Symbol),
  Line(Command)
)

There are symbols, which are marked by a colon after them, and commands. (More will come later.) A line can either have a symbol and a command or one of them.

I’ve written this grammar for this:

@top Program { (lineWithNewline | emptyLine)* lineWithoutNewline?  }

@skip { space | LineComment }

lineWithNewline { Line newLine }
lineWithoutNewline { Line }
emptyLine { newLine }

Line {
  (symbolDeclaration Command)
  | symbolDeclaration
  | Command
}

symbolDeclaration {
  Symbol ":"
}

@tokens {
  Command { $[a-zA-Z]+ }

  Symbol { $[a-zA-Z0-9_$]+ }

  LineComment { "#" ![\n]* }

  space { $[ \t\r]+ }

  newLine { "\n" }
}

When compiling the grammar, I get this error: Overlapping tokens Symbol and Command used in same context (example: "A")

I understand that the problem is that in Line the symbolDeclaration and Command are ambiguous. I don’t understand how I can insert the ambiguous marker ~ in this case to let the parser evaluate both possible options in parallel.

I’ve tried putting the : into Symbol like this:

  Symbol { $[a-zA-Z0-9_$]+ ":" }
  @precedence { Symbol, Command }

It then works, but I want to keep the symbol declaration marker (:) separate from the token.

Can anyone help me out? I know this is probably a noob question, but I’m stuck here.

Sebastian

The GLR ~ operator works on the parser level, not the tokenizer level, so that is probably not a solution here. What you could do is parse symbols with an external tokenizer that has higher precedence than the built-in tokenizer (appearing before it in the grammar file), which scans both the symbol and the colon, but then returns a token containing only the symbol. Or you could define a single identifier token that you use in both the Symbol and Command rules, allowing you to disambiguate them at the grammar level (I think you don’t even need ~ operators in this case, just a precedence, similar to how the JavaScript grammar distinguishes labels and variable names).

Thanks a lot for the quick reply! I’ve gone the second road and am using the generic identifier.

Here’s my new grammar:

@top Program { (lineWithNewline | emptyLine)* lineWithoutNewline?  }

@skip { space | LineComment }

lineWithNewline { Line newLine }
lineWithoutNewline { Line }
emptyLine { newLine }

Line {
  (SymbolDeclaration Command)
  | SymbolDeclaration
  | Command
}

SymbolDeclaration {
  Symbol ":"
}

Symbol {
  identifier
}

Command {
  identifier
}

@tokens {
  identifier { $[a-zA-Z0-9_$]+ }

  LineComment { "#" ![\n]* }

  space { $[ \t\r]+ }

  newLine { "\n" }
}