Token / nonterminal ambiguity

I’m working on a Scribble-like language. Programs are essentially plain text mixed with commands that need to be specially parsed. For example, this program:

The @emph{quick @bold{red}} fox {jumped}

Should be parsed as something like:

["The ", 
 ["cmd:emph", "quick ", ["cmd:bold", "red"]], 
 " fox {jumped}"]

I’m trying to figure out the idiomatic way to implement this parser in Lezer. My first attempt was the following:

@top Document { Token* }
Token { Text | Command }
Command { "@" Ident ("{" Text "}")? }
@tokens {
  Text { ![@]+ }
  Ident { $[a-zA-Z_$]$[a-zA-Z0-9_$]* }

However this gives me the error:

Overlapping tokens "{" and Text used in same context (example: "{")

The ambiguity is that for a command like @a{b}, the curly braces {b} could be parsed either as part of the command, or as part of the next token. The behavior I’d like to have is if the curly braces could be associated with a command then they should be, otherwise they should be treated as plain text.

What’s the best way to address this problem in Lezer?

  • The precedence mechanisms don’t work because the conflict is between a token (Text) and a non-terminal (Command). As far as I can tell, Lezer only allows you to distinguish token vs. token or non-terminal vs. non-terminal.
  • Ideally I don’t want to change Text into ![{}@] because the curly braces are generally allowable in text. This language will have several kinds of command sigils (@, %, #) and several kinds of delimiters (parens, brackets, braces). If none of those were permissible in the text (or had to be escaped), it would be limiting.
  • Maybe the answer is a context-sensitive external tokenizer? Couldn’t find a way to make that work.

The ? after the braced body is what’s causing this—the parser doesn’t know whether to start a body or parse plain text after an @-identifier.

I don’t really see a way to have } allowed in Text without introducing an ambiguity—except maybe by having a different text token for top-level text (which would be the only place where } has no special meaning).

Custom tokenizers might help here (a special handler for { tokens that only kicks in when they can be shifted, or different forms of Ident depending on whether there’s an opening brace after the token, so that the grammar rule can deterministically say whether there’s going to be a body). But the grammar seems a bit poorly conceptualized, so far.

1 Like