too many different token groups error

i keep hitting the error Too many different token groups (17) to represent them as a 16-bit bitfield. i tried searching for it but i got 0 results for lezer related posts.

what is it exactly and how to avoid it?

here is what my lezer roughly look like:

@top File { Label* }

Label {
  LabelStart Expr* LabelEnd
}

Expr {
  Generic |
  Comment |
  String |
  ScaledFont |
  ScaledFontAt |
  ChangeAlphaNumeric |
  FieldOrigin |
  GraphicBox |
  FieldReverse |
  Code128BarCode |
  BarCodeFieldDef |
  FieldTypeset |
  Invalid
}

Comment {
  CommentStart CommentData
}
// etc...

@tokens {
  SC {
    "^"
  }
  CommentStart {
    SC "FX"
  }
  CommentData {
    ![\^]+
  }

  // etc...
}

The parser generator automatically detects ambiguous tokens (that match the same input) and will, if they aren’t used in the same context, implicitly distinguish them by context. But the system used by this uses a 16-bit bitfield to store the set of contexts valid at a given parse position, and your grammar somehow causes it to produce more than sixteen different contextual token groups. This suggests you either have some highly ambiguous tokens, or you are using a bunch of different names for the same kind of token in different situations. If it’s the latter, consider using a single token type (possibly lower-cased) and wrapping it in nonterminals to tag it in a contextual way (i.e. VariableName { identifier }, TypeName { identifier }, etc).

yes i do have same tokens but different names for example:

@tokens {

  F1Option {
    "Y" | "N"
  }

  F2Option {
    "Y" | "N"
  }

}

so i just need to unify it and make the token names lowercase?

i also have these kind of tokens. are they considered to be ambiguous even though it does not error initially?

BarCodeFieldDefModWidth {
    $[1-9] "0"? // match 1 - 10
  }
  BarCodeFieldDefRatio {
    ("2." $[0-9]) | "3.0" // match 2.0 - 3.0
  }

Something like this, yes:

@tokens {
  yesNo { "Y" | "N" }
}

F1Option { yesNo }
F2Option{ yesNo }

… or just use the same upper-case token for these directly, since it may not be worth much to have different node names.

In general, it’s a good idea to not duplicate tokens. The ModWith/Ratio tokens could be defined unambiguously by stating that the Ratio always has precedence.

1 Like

ah ok i think im starting to get it. i was thinking that the rules for the node should go into @tokens but this is what i should be writing all along:

@top File { Label* }

Label {
  LabelStart Expr* LabelEnd
}

Expr {
  Generic |
  Comment
}

LabelStart {
  sc "XA"
}
LabelEnd {
  sc "XZ"
}

GenericStart {
  (sc "B1") |
  (sc "B2") |
  (sc "B3") |
  (sc "B4") |
  (sc "B5") |
  (sc "B7") |
  // etc for all unimplemented commands
  
}
GenericData {
  exceptCaretNewLine*
}
Generic {
  GenericStart GenericData
}

CommentStart {
  sc "FX"
}
CommentData {
  exceptCaret*
}
Comment {
  CommentStart CommentData
}

@tokens {
  sc {
    "^"
  }

  st {
    "~"
  }

  c {
    ","
  }

  exceptCaretNewLine {
    ![\^\n~]
  }

  exceptCaret {
    ![\^~]
  }
}