too many different token groups error

ivanjx · October 21, 2024, 10:48am

i keep hitting the error Too many different token groups (17) to represent them as a 16-bit bitfield. i tried searching for it but i got 0 results for lezer related posts.

what is it exactly and how to avoid it?

here is what my lezer roughly look like:

@top File { Label* }

Label {
  LabelStart Expr* LabelEnd
}

Expr {
  Generic |
  Comment |
  String |
  ScaledFont |
  ScaledFontAt |
  ChangeAlphaNumeric |
  FieldOrigin |
  GraphicBox |
  FieldReverse |
  Code128BarCode |
  BarCodeFieldDef |
  FieldTypeset |
  Invalid
}

Comment {
  CommentStart CommentData
}
// etc...

@tokens {
  SC {
    "^"
  }
  CommentStart {
    SC "FX"
  }
  CommentData {
    ![\^]+
  }

  // etc...
}

marijn · October 21, 2024, 11:15am

The parser generator automatically detects ambiguous tokens (that match the same input) and will, if they aren’t used in the same context, implicitly distinguish them by context. But the system used by this uses a 16-bit bitfield to store the set of contexts valid at a given parse position, and your grammar somehow causes it to produce more than sixteen different contextual token groups. This suggests you either have some highly ambiguous tokens, or you are using a bunch of different names for the same kind of token in different situations. If it’s the latter, consider using a single token type (possibly lower-cased) and wrapping it in nonterminals to tag it in a contextual way (i.e. VariableName { identifier }, TypeName { identifier }, etc).

ivanjx · October 21, 2024, 11:27am

yes i do have same tokens but different names for example:

@tokens {

  F1Option {
    "Y" | "N"
  }

  F2Option {
    "Y" | "N"
  }

}

so i just need to unify it and make the token names lowercase?

ivanjx · October 21, 2024, 11:34am

i also have these kind of tokens. are they considered to be ambiguous even though it does not error initially?

BarCodeFieldDefModWidth {
    $[1-9] "0"? // match 1 - 10
  }
  BarCodeFieldDefRatio {
    ("2." $[0-9]) | "3.0" // match 2.0 - 3.0
  }

marijn · October 21, 2024, 11:44am

Something like this, yes:

@tokens {
  yesNo { "Y" | "N" }
}

F1Option { yesNo }
F2Option{ yesNo }

… or just use the same upper-case token for these directly, since it may not be worth much to have different node names.

In general, it’s a good idea to not duplicate tokens. The ModWith/Ratio tokens could be defined unambiguously by stating that the Ratio always has precedence.

ivanjx · October 21, 2024, 12:03pm

ah ok i think im starting to get it. i was thinking that the rules for the node should go into @tokens but this is what i should be writing all along:

@top File { Label* }

Label {
  LabelStart Expr* LabelEnd
}

Expr {
  Generic |
  Comment
}

LabelStart {
  sc "XA"
}
LabelEnd {
  sc "XZ"
}

GenericStart {
  (sc "B1") |
  (sc "B2") |
  (sc "B3") |
  (sc "B4") |
  (sc "B5") |
  (sc "B7") |
  // etc for all unimplemented commands
  
}
GenericData {
  exceptCaretNewLine*
}
Generic {
  GenericStart GenericData
}

CommentStart {
  sc "FX"
}
CommentData {
  exceptCaret*
}
Comment {
  CommentStart CommentData
}

@tokens {
  sc {
    "^"
  }

  st {
    "~"
  }

  c {
    ","
  }

  exceptCaretNewLine {
    ![\^\n~]
  }

  exceptCaret {
    ![\^~]
  }
}