Overlapping tokens in lezer

Hi,

I’m new to Lezer and writing grammars. What I want to do is write a simple grammar that can parse a basic assembler dialect (for educational purposes).

Here’s an example of the test I’d like to work:

# Lines

symbol1: CMP
symbol2:
MOV

==>

Program(
  Line(Symbol Command),
  Line(Symbol),
  Line(Command)
)

There are symbols, which are marked by a colon after them, and commands. (More will come later.) A line can either have a symbol and a command or one of them.

I’ve written this grammar for this:

@top Program { (lineWithNewline | emptyLine)* lineWithoutNewline?  }

@skip { space | LineComment }

lineWithNewline { Line newLine }
lineWithoutNewline { Line }
emptyLine { newLine }

Line {
  (symbolDeclaration Command)
  | symbolDeclaration
  | Command
}

symbolDeclaration {
  Symbol ":"
}

@tokens {
  Command { $[a-zA-Z]+ }

  Symbol { $[a-zA-Z0-9_$]+ }

  LineComment { "#" ![\n]* }

  space { $[ \t\r]+ }

  newLine { "\n" }
}

When compiling the grammar, I get this error: Overlapping tokens Symbol and Command used in same context (example: "A")

I understand that the problem is that in Line the symbolDeclaration and Command are ambiguous. I don’t understand how I can insert the ambiguous marker ~ in this case to let the parser evaluate both possible options in parallel.

I’ve tried putting the : into Symbol like this:

  Symbol { $[a-zA-Z0-9_$]+ ":" }
  @precedence { Symbol, Command }

It then works, but I want to keep the symbol declaration marker (:) separate from the token.

Can anyone help me out? I know this is probably a noob question, but I’m stuck here.

Sebastian

The GLR ~ operator works on the parser level, not the tokenizer level, so that is probably not a solution here. What you could do is parse symbols with an external tokenizer that has higher precedence than the built-in tokenizer (appearing before it in the grammar file), which scans both the symbol and the colon, but then returns a token containing only the symbol. Or you could define a single identifier token that you use in both the Symbol and Command rules, allowing you to disambiguate them at the grammar level (I think you don’t even need ~ operators in this case, just a precedence, similar to how the JavaScript grammar distinguishes labels and variable names).

Thanks a lot for the quick reply! I’ve gone the second road and am using the generic identifier.

Here’s my new grammar:

@top Program { (lineWithNewline | emptyLine)* lineWithoutNewline?  }

@skip { space | LineComment }

lineWithNewline { Line newLine }
lineWithoutNewline { Line }
emptyLine { newLine }

Line {
  (SymbolDeclaration Command)
  | SymbolDeclaration
  | Command
}

SymbolDeclaration {
  Symbol ":"
}

Symbol {
  identifier
}

Command {
  identifier
}

@tokens {
  identifier { $[a-zA-Z0-9_$]+ }

  LineComment { "#" ![\n]* }

  space { $[ \t\r]+ }

  newLine { "\n" }
}

I think I have the same/similar type of problem, but I don’t fully understand what’s going on w/ my observations or how @marijn’s response applies. And apologies up front if my lack of knowledge/experience in this field makes this a dumb question.
For a simple grammar trying to parse either a Value or Name=Value pair where Name’s are alphabetic and Values are alphanumeric. i.e. (doesn’t work)

@tokens {
 name { $[a-z]+ }
 value { $[a-z0-9]+ }
}
@top Arg { (Name "=" Value | Value) }
Name { name }
Value { value }

I understand that name and value tokens overlap, and thought that since name is more restrictive it should have token precedence.

@precedence { name, value }

but in this case foo=bar failed to parse (foo=10 worked, as would be expected). My understanding of why this failed seemed to align w/ marijn’s response - that bar was tokenized as a name token because of the precedence so it failed to match the Value production, and once a token is tokenized it cannot be re-tokenized to a different type. Thinking I was clever and cared more about the grammar than the tokenizer, I changed Value to be

Value { (name | value) }

which worked. But for some reason i can’t remember (my actual case was more complex than this very-reduced form), i changed the token precedence to

@precedence { value, name }

To me, this meant that foo would now tokenize as a value token, and would fail to match the Name production. But the results were exactly the same - foo was parsed as a Name so its token type must have been name.

So this is where/why i’m confused - is the token allowed to be multiple types, and if so why did the original

Value { value }

fail (bar could have had both name and value types at tokenize time)? Is the = token being look-ahead playing the key role somehow?

This was made a lot more confusing by a but that existed in @lezer/generator (fixed in version 0.15.4, which I just released). The way it is supposed to work is that you’ll just never get any name tokens if you give value precedence, and only get value tokens for things that start with numbers if you give name predecedence—the tokenizer will pick the highest-precedence token that matches a given bit of input. So the way you’re setting up these tokens just isn’t going to do anything useful, since they can’t be meaningfully distinguished by looking at a stretch of input characters.

ahhh - i didn’t even think this would be a bug so didn’t consider mentioning the version i was using. Thanks for taking the time to figure that part out [0.15.0] and respond in general.

It did always feel like i was placing grammar rules in the tokenizer, so this new behavior (at least the parts i think i understand) makes much more sense to me. Is it correct then to say w/ the new 0.15.4 behavior something like the following would work (here it feels like i’m still figuring out useful characteristics of the tokens that can be leveraged in the grammar, but the tokens are still ‘dumb’)

@tokens {
    word { $[a-zA-Z0-9]+ }
    wordFirstCharAlpha { $[a-z] word? }
    @precedence {wordFirstCharAlpha, word}
}
@top Arg { (Name "=" Value | Value) }
words { (word | wordFirstCharAlpha )}
identifier { wordFirstCharAlpha }
Name { identifier }
Value { words }

At a glance, that looks reasonable, yes.