Unexpected syntax tree when adding a third character to the value of a comparison

pe.affenzeller · December 18, 2023, 4:31pm

hi!

I built a grammar to parse a simple filter syntax. there’s not too much to it, but in certain cases, I get an unexpected result and I don’t understand why, maybe someone could help me figure it out.

I’m using the lezer playground app to check the lezer syntax tree.
the grammar is defined like this:

@top Filter { expression* }

expression { FilterGroup | FilterStatement | LogicalOperator | space }

FilterGroup { "(" (FilterStatement | LogicalOperator | FilterGroup | space)+ ")" }

FilterStatement { FilterKey space? (comparison | inclusion) }

comparison { ComparisonOperator space? FilterSimpleValue }
// comparison { (ComparisonOperator | InvalidComparisonOperator) space? FilterSimpleValue }
inclusion { Inclusion space? FilterIncludesValue }

FilterKey { Identifier | String }

FilterSimpleValue[@isGroup=FilterValue] { ((PositionOperator | LikeOperator)? (Identifier | String) PositionOperator?) }

FilterIncludesValue[@isGroup=FilterValue] {
  List | IncompleteList
}

@tokens {
  space { @whitespace }

  ComparisonOperator { '=' | '!=' | '<' | '<=' | '>' | '>=' }
  InvalidComparisonOperator { $[A-Za-z0-9!@#$%^&*?,_\\.-/]+ space }
  LogicalOperator { 'and' | 'AND' | 'or' | 'OR' }
  Inclusion { 'in' | '!in' | 'IN' | '!IN'}
  PositionOperator { "*" }
  LikeOperator { "?" }

  // Account for empty pair of brackets to not immediately start a new group, but wait
  // for a range / list to be finished inside a filter statement.
  // Range { "(" (Identifier space? ("to" | "TO") space? Identifier) ")" }
  List { ("(" space* ")" | "(" space? (Identifier | String) ("," space? (Identifier | String))* space? ")") }

  Identifier { $[A-Za-z0-9_] $[A-Za-z0-9_.-]* }
  String { '"' !["]* '"' }

  IncompleteRange { "(" ((space? Identifier?) | (space? Identifier space?)) (($[tT]+$[oO]?)? | ("to" | "TO") space? Identifier?) ")"? }
  IncompleteList { "(" ")"? }

  @precedence { LogicalOperator, Identifier }
  @precedence { Range, IncompleteRange }
  @precedence { List, IncompleteList }
  @precedence { Inclusion, InvalidComparisonOperator }
}

valid input is parsed as expected. for showing proper error messages, I tested some invalid statements and got unexpected results.

entering

foo , ba

I get at least a somewhat expected result with one filter statement, the statement having a key, two errors, and a value.

changing the input to

foo , bar

all the sudden, I get two filter statements and I have no idea why there should be a difference between ba and bar. as far as I see, the number of character should not matter for the filter value.

any input would be highly appreciated

best,
peter

marijn · December 19, 2023, 7:45am

Error recovery can be affected by small things like that, and is not something you can generally make many assumptions about.

pe.affenzeller · December 19, 2023, 10:32am

ok, so would you recommend including the currently commented out InvalidComparisonOperator to match invalid operators in order to get predictable results? or can you think of another option that may help?

marijn · December 19, 2023, 10:52am

Adding explicit rules that match invalid input can help make output more predictable, indeed. No, there’s no options that change the way built-in error recovery works.