This is my first serious attempt at implementing a Lezer grammar. I chose XPath 1.0 (in later versions XPath is really a subset of XQuery rather than its own thing, and seems to gain quite a bit of complexity because of that, so I stuck to 1.0).
@top XPath { expr }
keyword<w> { @specialize<localName, w> }
@precedence {
invoke @left
filter @left
path @left
union @left
unary @right
multiplicative @left
additive @left
relational @left
equality @left
and @left
or @left
}
rootStep {
Child { Root { '/' } !path step }
| Descendant { Root { '//' } !path step }
}
step {
AxisSpecified { AxisName { localName } '::' generalStep}
| AttrSpecified { '@' generalStep }
| generalStep
| SelfStep { '.' }
| ParentStep { '..' }
}
generalStep {
NameTest
| Invoke {
FunctionName { name }
!invoke
ArgumentList {
'(' ( expr ( ',' expr )* )? ')'
}
}
}
expr {
VariableReference
| '(' expr ')'
| StringLiteral
| NumberLiteral
| rootStep
| step
| Child { expr !path '/' step }
| Descendant { expr !path '//' step }
| Filtered { expr !filter '[' expr ']' }
| UnionExpr { expr !union '|' expr }
| UnaryNegativeExpr { '-' !unary expr }
| MultiplyExpr { expr !multiplicative '*' expr }
| DivideExpr { expr !multiplicative keyword<'div'> expr }
| ModulusExpr { expr !multiplicative keyword<'mod'> expr }
| AddExpr { expr !additive '+' expr }
| SubtractExpr { expr !additive '-' expr }
| GreaterThanExpr { expr !relational '>' expr }
| GreaterEqualExpr { expr !relational '>=' expr }
| LessThanExpr { expr !relational '<' expr }
| LessEqualExpr { expr !relational '<=' expr }
| NotEqualsExpr { expr !equality '!=' expr }
| EqualsExpr { expr !equality '=' expr }
| AndExpr { expr !and keyword<'and'> expr }
| OrExpr { expr !or keyword<'or'> expr }
}
NameTest {
name
| wildcard
| qualifiedWildcard
}
name {
localName
| qualifiedName
}
@skip {
whitespace
}
@tokens {
whitespace { $[ \r\n\t]+ }
localNameStartChar {
std.asciiLetter | "_"
| $[\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D\u037F-\u1FFF\u200C-\u200D]
| $[\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD\u{10000}-\u{EFFFF}]
}
localNameChar {
localNameStartChar | "-" | "." | std.digit | $[\u00B7\u0300-\u036F\u203F-\u2040]
}
localName {
localNameStartChar localNameChar*
}
qualifiedName {
localName ':' localName
}
wildcard {
'*'
}
qualifiedWildcard {
localName ':' '*'
}
StringLiteral {
'"' !["]* '"'
| "'" ![']* "'"
}
NumberLiteral {
@digit+ ('.' @digit*)?
| '.' @digit+
}
VariableReference {
'$' localName
| '$' qualifiedName
}
@precedence {
qualifiedWildcard qualifiedName localName
NumberLiteral '.'
}
}
I’ve simplified things a bit – for example, instead of trying to disambiguate between a function call expression and a node type filtering path step (as in true() vs. comment()) both cases are covered by Invoke, with no name/args validation on the node type filter. There is also no check that an axis name is valid, so the grammar will happily accept //my-completely-made-up-axis::*.
Any and all feedback is welcome, especially if you can see a case where a valid path wouldn’t be parsed properly, but also any problems with how the grammar is written from an efficiency (or even stylistic) perspective.
Thanks!