Context sensitive whitespace

jamesh · July 3, 2021, 4:18pm

I have a (minimized) grammar that parsers simple statement expressions. Binary operators are +, *. Adjacent expressions are otherwise interpretted as implicit *. In the presented grammar, terminals are just identifiers, and parenthesized expressions.

I’d like to be able to parse statements that include a line separator; e.g. this is one statement:

x + 
y +
z

But not require any other statement separator. e.g. the following lines are three statements:

x + x
x + y
y + y

So I think I want to implement a zero-length token that disallows \n. I called it nbsp:

BinaryExpression2 {
    term2 !implicitMultiply nbsp term2
}

I thought using an @explicit zero-length token might work, but I think I may have misunderstood how external tokenizers work.

I find in the docs that it is possible to do what I want to do, but I think I couldn’t work out how:

Even white space, the type of tokens implicitly skipped by the parser, is contextual, and you can have different rules skip different things.

I’m definitely missing something and/or doing something wrong, I’m hoping it’ll be easy to spot.

My minimized and simplified grammar and ExternalTokenizer:

@precedence {
  implicitMultiply @left,
  multiply @left,
  plus @left
}

@top Start {
  statements
}

statements {
  topLevelStatement |
  Block
}

Block {
  topLevelStatement (semi topLevelStatement)+
}

topLevelStatement[@isGroup="Statement"] {
  "" |
  ExpressionStatement { term1 }
}

term1[@isGroup="Expression"] {
  BinaryExpression1 |
  BinaryExpression2 |
  term2
}

BinaryExpression1 {
  term1 !multiply op<"*"> term1 |
  term1 !plus op<"+"> term1
}

BinaryExpression2 {
  term2 (!implicitMultiply nbsp term2)+
}

term2[@isGroup="Expression"] {
  Symbol |
  Parentheses
}

Parentheses {
  "(" term1 ")"
}

Symbol {
  identifier
}

semi { ";" | insertSemi }

@skip { whitespace }

op[@name="Operator"]<content> {
  content
}

@tokens {
  whitespace { std.whitespace+ }
  identifierChar { std.asciiLetter | $[_$\u{a1}-\u{10ffff}] }
  identifier { identifierChar (identifierChar | std.digit)* }
  @precedence { identifier, whitespace }
}

@external tokens insertSemicolon from "./tokens" { insertSemi }
@external tokens nonBreakingSpace from "./tokens" { nbsp }

in ./tokens:

export const nonBreakingSpace = new ExternalTokenizer(
  (input, token, stack) => {
    const next = input.get(token.start)
    const zeroLengthToken = token.start >= token.end
    const hasTokensBeforeNewLine = tokensBeforeNewline(input, token.start)
    const isNbsp = zeroLengthToken && hasTokensBeforeNewLine
    if ((next === -1 || isNbsp) && stack.canShift(nbsp)) {
      token.accept(nbsp, token.start)
    }
  },
  { contextual: true, fallback: true, extend: false }
)

function tokensBeforeNewline(input: Input, pos: number): boolean {
  const eol = input.lineAfter(pos)
  return !!eol.trim()
}

Any help would be gratefully received. Thanks.

jamesh · July 4, 2021, 5:23pm

I’ve also tried adding a skip expression, with a whitespace token not containing \n.

@skip {
  noBreakWhitespace
} {
  BinaryExpression2 {
    term2 (!implicitMultiply term2)+
  }
}

This gives an error “Inconsistent skip sets after term2”.

marijn · July 5, 2021, 8:07am

The parser needs to know what to skip after a term2, and if it’s either the regular top-level skip set, or only spaces, it can’t.

This might be best done with an external tokenizer that produces an implicityMultiply token when it sees spaces followed by something that might start a term after the current position. (Put above your @tokens block so that it runs before the whitespace is matched.)

jamesh · July 10, 2021, 2:45pm

Thanks for replying and your help. I tried your suggestions, and couldn’t get them to work, but I think I’m beginning to understand why.

As I understand this, the skip expressions approach isn’t going to work, because they don’t work for non-terminals, e.g.

@skip { noNewLines } {
    BinaryExpression2 {
         term2 (!implicitMultiply term2)
    }
}

term2 { Identifier }

@skip { whitespace }

considering the input:

x
y

A whitespace anywhere in that expression may belong to the skip set of term2 or the skip set of BinaryExpression2. lezer can’t tell if we’ve descended into term2, or still in BinaryExpression2 and so errors when the parser is being built.

This is counter-intuitive but— from a lezer's eye-view— understandable. I don’t know what lezer should do in the general case, or how to persuade it to do what I want in the specific.

For the external tokens approach (placing it above the @tokens block):

@external tokens nonBreakingSpace from "./tokens" { nbsp }

BinaryExpression2 {
    term2 nbsp term2
}

@tokens {
    …
    whitepace { std.whitespace+ }
    …
}
@skip { whitepace }

this is easier to understand why it doesn’t work, considering the same two line block:

x
y

I want them to be two expressions. The newline between the x and y is skipped as part of std.whitespace between x and the nbsp at the immediate start of the second line: so it gets parsed as a BinaryExpression2 instead of two ExpressionStatements.

My next approach would be to remove newlines from the skip, and annotating the grammar at the places where newlines are allowed, rather than where they are not. This feels like a last resort, and I expect will end up with a lot of conflicts; so I’d like to avoid this if possible.

What am I missing?

marijn · July 11, 2021, 3:15pm

Firstly, I think non-breaking space is not a great name for this (it already has a different meaning).

But also, my idea was for the external token to explicitly indicate that an implicit multiply is allowed at that point. So a tokenizer that returns an implicitMultiply token when it sees one or more spaces followed by a letter (or digit, etc). The contextual tokenization will make sure that it only runs in places where that token may appear (after a term), so this should be relatively efficient. It could return a token covering the whitespace, which you’d use instead of nbsp in your rule for multiplication. Does that make sense?