Contextual tokenization assignin unexpected token

Hey! Here’s a very stripped-down grammar:

@precedence {
  ternaryOp @right
}

@top Source { expression }

expression { UnaryOperator | BinaryOperator | Integer | Call }

UnaryOperator { "&" expression }

BinaryOperator {
  expression !ternaryOp "//" expression |
  OperatorIdentifier "/" Integer
}

OperatorIdentifier { "/" }

Call { Identifier expression?  }

@tokens {
  Integer { $[0-9] }
  Identifier { $[a-z] }
}

Consider the input &//2. It seems clear to me that it should tokenize & and then the only valid unambiguous continuation is "/" (OperatorIdentifier) "/" 2 (Integer). However, it looks like // is tokenized as a whole and the parsing fails, but // is not valid in that context.

Now, if we change the seemingly unrelated Call definition so expression is no longer optional, the parsing succeeds.

The parser generator will not use contextual tokens in this situation. Because longer literals automatically take precedence over shorter ones, "/" vs "//" is not counted as a conflict, and no separate token groups are created.

There’s no real elegant solution to this, but you could move "//" into an external tokenizer. Those automatically get their own ‘group’ for the purposes of contextual tokenization.

Thank you, in that case I will use an external tokenizer : ) My main surprise was that the optional part of Call made a difference.

A sidenote question, from efficiency perspective is there a preference between using a single tokenizer with stack.canShift checks as opposed to separate tokenizers? My intuition is that if the tokenizer runs only in places where it is valid, then multiple granular tokenizers may be better, unless we know they run in the same context and can reuse work. Does that make sense?

It shouldn’t make much difference unless the actual work the tokenizers do overlaps—say, if multiple tokens have to scan indentation depth or past identifiers or something.

1 Like