Understanding how rules are parsed

I’m experimenting with Lezer and would like to create a minimal grammar for expressions like:

  • {x} + {y} + {z}
  • {a var} + {another var}

In the grammar, a term like {a var} is a Variable and the string inside the curly braces is a VariableName. Note that VariableNames can contain whitespace, but cannot start or begin with whitespace (i.e. {a var} is valid, but { a} and {var } are not). Also, whitespace between Variables should be skipped.

My attempt at this is as follows:

@precedence {
  add @left
}

@top Lang { Expression }

@skip { space }

Expression { (Variable | Expression) !add "+" (Variable | Expression) }

Variable { "{" VariableName "}" }

@tokens {
  space { std.whitespace+ }
  char { std.asciiLetter }
  VariableName { char+ (space char+)* }
}

Using the resultant parser to parse the string a+{b} + {c d} produces the following tree:

Lang:
├╴⚠: 
└╴Expression:
  ├╴Expression:
  │ ├╴Variable:
  │ │ ├╴VariableName: a
  │ │ └╴⚠: 
  │ └╴Variable:
  │   └╴VariableName: b
  └╴Variable:
    └╴VariableName: c d

I’m scratching my head to understand why a is parsed as the VariableName child to a Variable parent when the curly braces are missing? I would have thought the segment a+ would be considered invalid and {b} + {c d} would be parsed as a valid expression.

The reason why this matters to me is because I’m trying to use the parser in CM6 and would like to use styleTags to highlight VariableNames within valid Variables, i.e. whenever they appear within curly brackets. But with a parser based on the above grammar and a styleTag node selector like Variable/VariableName, the term a is being highlighted in the string a+{b} + {c d}, even though it’s not the VariableName of a valid Variable.

There’s nothing else in your grammar that matches a variable name without curly braces around it, so the error-tolerant parsing kicks in.

Thanks, that’s helpful to know. So one way to make the parser “ignore” an invalid piece like a+ would be to remove VariableName from the grammar and just define Variable as a token, like this:

@precedence {
  add @left
}

@top Lang { Expression }

@skip { space }

Expression { (Variable | Expression) !add "+" (Variable | Expression) }

@tokens {
  space { std.whitespace+ }
  char { std.asciiLetter }
  Variable { "{" char+ (space char+)* "}" }
}

In this case, I lose the ability to specifically mark the content inside the curly braces using a styleTag, however maybe I could use something like a range and a decorator to achieve that instead. Would this the best solution?

The parser doesn’t really ignore anything, it’ll just emit even less useful nodes (just error nodes) for the content like this. I’m not sure why you’d want to change your grammar in that way.

Is the example input an actually valid input in your language? If so, the way to clean this up would be to extend your grammar to recognize it.

The string a+{b} + {c d} is not valid, however the segment {b} + {c d} is valid. Ideally, if I were to type a+{b} + {c d} into my CM6 view, I’d like to:

  1. Highlight the variable names b and c d (e.g. in bold)
  2. Not highlight a because it’s not a valid variable name since it’s not in curly brackets
  3. Show some sort of error for the a+ segment

I’m trying to figure out if (1) and (2) can be done purely with the parser + styleTags. For (3) I assume I’ll need a linter.

You can set up styleTags to style Variable/VarableName if you only want to target variable name nodes that have a variable node as parent.

That’s what I thought as well, but it doesn’t seem to work as expected. I set up this minimal example in StackBlitz to illustrate - as you can see the initial a in a+{b} + {c d} gets highlighted despite the fact that the rule in lang.js is to style only Variable/VariableName.

I presume this is an issue with the grammar somehow, but just can’t figure out a good way to solve it.

Oh, right, it is creating a Variable node around it because that’s the only place a VariableName may occur in the tree.

This isn’t so much an issue with your grammar as an issue with your expectations of how highlighting will work—Lezer will build a tree covering as much of the input as it can manage, and that doesn’t really cover highlighting stuff parsed inside an error-recovery differently (since usually, you’ll still want your tokens highlighted in an identifiable way, even if they aren’t syntactically valid there).

1 Like