XPath 1.0 grammar [Seeking implementation critique]

duncanc · August 10, 2022, 11:30am

This is my first serious attempt at implementing a Lezer grammar. I chose XPath 1.0 (in later versions XPath is really a subset of XQuery rather than its own thing, and seems to gain quite a bit of complexity because of that, so I stuck to 1.0).

@top XPath { expr }

keyword<w> { @specialize<localName, w> }

@precedence {
  invoke @left
  filter @left
  path @left
  union @left
  unary @right
  multiplicative @left
  additive @left
  relational @left
  equality @left
  and @left
  or @left
}

rootStep {
  Child { Root { '/' } !path step }
  | Descendant { Root { '//' } !path step }
}

step {
  AxisSpecified { AxisName { localName } '::' generalStep}
  | AttrSpecified { '@' generalStep }
  | generalStep
  | SelfStep { '.' }
  | ParentStep { '..' }
}

generalStep {
  NameTest
  | Invoke {
    FunctionName { name }
    !invoke
    ArgumentList {
      '(' ( expr ( ',' expr )* )? ')'
    }
  }
}

expr {
  VariableReference
  | '(' expr ')'
  | StringLiteral
  | NumberLiteral
  | rootStep
  | step
  | Child { expr !path '/' step }
  | Descendant { expr !path '//' step }
  | Filtered { expr !filter '[' expr ']' }
  | UnionExpr { expr !union '|' expr }
  | UnaryNegativeExpr { '-' !unary expr }
  | MultiplyExpr { expr !multiplicative '*' expr }
  | DivideExpr { expr !multiplicative keyword<'div'> expr }
  | ModulusExpr { expr !multiplicative keyword<'mod'> expr }
  | AddExpr { expr !additive '+' expr }
  | SubtractExpr { expr !additive '-' expr }
  | GreaterThanExpr { expr !relational '>' expr }
  | GreaterEqualExpr { expr !relational '>=' expr }
  | LessThanExpr { expr !relational '<' expr }
  | LessEqualExpr { expr !relational '<=' expr }
  | NotEqualsExpr { expr !equality '!=' expr }
  | EqualsExpr { expr !equality '=' expr }
  | AndExpr { expr !and keyword<'and'> expr }
  | OrExpr { expr !or keyword<'or'> expr }
}

NameTest {
  name
  | wildcard
  | qualifiedWildcard
}

name {
  localName
  | qualifiedName
}

@skip {
  whitespace
}

@tokens {
  whitespace { $[ \r\n\t]+ }
  localNameStartChar {
    std.asciiLetter | "_"
    | $[\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D\u037F-\u1FFF\u200C-\u200D]
    | $[\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD\u{10000}-\u{EFFFF}]
  }
  localNameChar {
    localNameStartChar | "-" | "." | std.digit | $[\u00B7\u0300-\u036F\u203F-\u2040]
  }
  localName {
    localNameStartChar localNameChar*
  }
  qualifiedName {
    localName ':' localName
  }
  wildcard {
    '*'
  }
  qualifiedWildcard {
    localName ':' '*'
  }
  StringLiteral {
    '"' !["]* '"'
    | "'" ![']* "'"
  }
  NumberLiteral {
    @digit+ ('.' @digit*)?
    | '.' @digit+
  }
  VariableReference {
    '$' localName
    | '$' qualifiedName
  }
  @precedence {
    qualifiedWildcard qualifiedName localName
    NumberLiteral '.'
  }
}

I’ve simplified things a bit – for example, instead of trying to disambiguate between a function call expression and a node type filtering path step (as in true() vs. comment()) both cases are covered by Invoke, with no name/args validation on the node type filter. There is also no check that an axis name is valid, so the grammar will happily accept //my-completely-made-up-axis::*.

Any and all feedback is welcome, especially if you can see a case where a valid path wouldn’t be parsed properly, but also any problems with how the grammar is written from an efficiency (or even stylistic) perspective.

Thanks!

marijn · August 10, 2022, 11:50am

I don’t XPath well enough to analyze the details, but at a glance the grammar looks good!