Contextual tokenization assignin unexpected token

jonatanklosko · January 17, 2024, 4:11pm

Hey! Here’s a very stripped-down grammar:

@precedence {
  ternaryOp @right
}

@top Source { expression }

expression { UnaryOperator | BinaryOperator | Integer | Call }

UnaryOperator { "&" expression }

BinaryOperator {
  expression !ternaryOp "//" expression |
  OperatorIdentifier "/" Integer
}

OperatorIdentifier { "/" }

Call { Identifier expression?  }

@tokens {
  Integer { $[0-9] }
  Identifier { $[a-z] }
}

Consider the input &//2. It seems clear to me that it should tokenize & and then the only valid unambiguous continuation is "/" (OperatorIdentifier) "/" 2 (Integer). However, it looks like // is tokenized as a whole and the parsing fails, but // is not valid in that context.

Now, if we change the seemingly unrelated Call definition so expression is no longer optional, the parsing succeeds.

marijn · January 18, 2024, 7:22am

The parser generator will not use contextual tokens in this situation. Because longer literals automatically take precedence over shorter ones, "/" vs "//" is not counted as a conflict, and no separate token groups are created.

There’s no real elegant solution to this, but you could move "//" into an external tokenizer. Those automatically get their own ‘group’ for the purposes of contextual tokenization.

jonatanklosko · January 18, 2024, 7:52am

Thank you, in that case I will use an external tokenizer : ) My main surprise was that the optional part of Call made a difference.

jonatanklosko · January 18, 2024, 7:56am

A sidenote question, from efficiency perspective is there a preference between using a single tokenizer with stack.canShift checks as opposed to separate tokenizers? My intuition is that if the tokenizer runs only in places where it is valid, then multiple granular tokenizers may be better, unless we know they run in the same context and can reuse work. Does that make sense?

marijn · January 18, 2024, 8:47am

It shouldn’t make much difference unless the actual work the tokenizers do overlaps—say, if multiple tokens have to scan indentation depth or past identifiers or something.