I have a (minimized) grammar that parsers simple statement expressions. Binary operators are +
, *
. Adjacent expressions are otherwise interpretted as implicit *
. In the presented grammar, terminals are just identifiers, and parenthesized expressions.
I’d like to be able to parse statements that include a line separator; e.g. this is one statement:
x +
y +
z
But not require any other statement separator. e.g. the following lines are three statements:
x + x
x + y
y + y
So I think I want to implement a zero-length token that disallows \n
. I called it nbsp
:
BinaryExpression2 {
term2 !implicitMultiply nbsp term2
}
I thought using an @explicit
zero-length token might work, but I think I may have misunderstood how external tokenizers work.
I find in the docs that it is possible to do what I want to do, but I think I couldn’t work out how:
Even white space, the type of tokens implicitly skipped by the parser, is contextual, and you can have different rules skip different things.
I’m definitely missing something and/or doing something wrong, I’m hoping it’ll be easy to spot.
My minimized and simplified grammar and ExternalTokenizer
:
@precedence {
implicitMultiply @left,
multiply @left,
plus @left
}
@top Start {
statements
}
statements {
topLevelStatement |
Block
}
Block {
topLevelStatement (semi topLevelStatement)+
}
topLevelStatement[@isGroup="Statement"] {
"" |
ExpressionStatement { term1 }
}
term1[@isGroup="Expression"] {
BinaryExpression1 |
BinaryExpression2 |
term2
}
BinaryExpression1 {
term1 !multiply op<"*"> term1 |
term1 !plus op<"+"> term1
}
BinaryExpression2 {
term2 (!implicitMultiply nbsp term2)+
}
term2[@isGroup="Expression"] {
Symbol |
Parentheses
}
Parentheses {
"(" term1 ")"
}
Symbol {
identifier
}
semi { ";" | insertSemi }
@skip { whitespace }
op[@name="Operator"]<content> {
content
}
@tokens {
whitespace { std.whitespace+ }
identifierChar { std.asciiLetter | $[_$\u{a1}-\u{10ffff}] }
identifier { identifierChar (identifierChar | std.digit)* }
@precedence { identifier, whitespace }
}
@external tokens insertSemicolon from "./tokens" { insertSemi }
@external tokens nonBreakingSpace from "./tokens" { nbsp }
in ./tokens
:
export const nonBreakingSpace = new ExternalTokenizer(
(input, token, stack) => {
const next = input.get(token.start)
const zeroLengthToken = token.start >= token.end
const hasTokensBeforeNewLine = tokensBeforeNewline(input, token.start)
const isNbsp = zeroLengthToken && hasTokensBeforeNewLine
if ((next === -1 || isNbsp) && stack.canShift(nbsp)) {
token.accept(nbsp, token.start)
}
},
{ contextual: true, fallback: true, extend: false }
)
function tokensBeforeNewline(input: Input, pos: number): boolean {
const eol = input.lineAfter(pos)
return !!eol.trim()
}
Any help would be gratefully received. Thanks.