Specializer using other token for spezialization instead of literals

pe.affenzeller · February 1, 2024, 4:05pm

Hi!

I’m trying to figure out a way to specialize a token, but couldn’t find anything that would work for my case. I’d like to match durations with count and unit and if that doesn’t match, just get an identifier.

atm, my grammar looks like this:

@top Filter { expression* }

expression { Value }

Boolean { @specialize<Identifier, "true" | "false"> }

Value { Identifier | Boolean | Duration | String }

@tokens {
  space { @whitespace }
  newLine { '\n'}
  endOfValue { space | newLine | std.eof }
  escapedCharacter { "\\" _ }

  Duration { DurationCount DurationUnit endOfValue }
  DurationCount { @digit+ }
  DurationUnit { 'ns' | 'ms' | 's' | 'm' | 'h' | 'd' }

  // An Identifier can not start with '.' or ','
  Identifier { (escapedCharacter | $[A-Za-z0-9_-]) (escapedCharacter | $[A-Za-z0-9_.,-])* }
  String { '"' !["]* '"' }

  @precedence { space, Duration, Identifier }
}

the resulting tree works just fine, but I would like to have the Duration token split up into count and unit.
once I move the token out of the @tokens definition to get both count and unit in the tree, I get overlapping tokens with count and identifier.

using the precedence for those doesn’t work, because:
“100ms” would match

{ Value: { Duration { DurationCount, DurationUnit } } }

as expected
but
“100mss” would result in

{ Value: { Duration { DurationCount, DurationUnit, ⚠ }, Value { Identifier } } }

but should only be one Identifier.

What I’d need is something like “specialize the Identifier if it matches a duration and give me DurationCount and DurationUnit in the tree”.
So (not possible like this):

Duration { @specialize<Identifier, { DurationCount DurationUnit }> DurationCount DurationUnit }

Is there something available I can use to achieve that?

best,
Peter

marijn · February 1, 2024, 4:30pm

Tokens cannot be split, and specialization happens per token. So you’ll probably have to settle for something like an external tokenizer that looks ahead to see whether there is a unit after a number, and produce a special token for the number, here.

pe.affenzeller · February 2, 2024, 6:55am

cool, that’s what I expected, thx for the quick response! helps a lot to move on with that topic!