How to match end of file in line oriented grammars?

UPDATE: this has been resolved and the fixed grammar is in the repo linked in this message.

I’ve created a line-oriented grammar for a project I’m working on where each line is a separate statement in the language. I have it all working except for the last line which may not have a newline ("\n") in the input as it hits the end of file first. For example:

"foo baz\n" works vs "foo bar" which doesn’t because there is no final "\n"

I’ve looked at the Python grammar with its external tokenizer in tokens.js but when I try to create my own version of that for my grammar the parser it doesn’t work (and sometimes goes into an infinite loop and crashes the browser depending on how I set the fallback option). I must not be advancing the token stream or I’ve configured it wrong.

To help debug the problem I have created a GitHub repo with two simple line-oriented grammars, one with just newlines and one with an external tokenizer to match end of file. The repo is here and can be run via npm install && npm test:

the two example grammars are:

@top NewLineExample { line+ }

line { Foo | emptyLine }
emptyLine { whitespace* newLine }
newLine { "\n" }

Foo { "foo" whitespace Var whitespace? newLine }
Var { identifier }

@tokens {
  singlespace { " " | "\t" }
  whitespace { singlespace+ }
  identifier { std.asciiLetter+ }
}

and

@top NewlineAndEOFExample { line+ }

line { Foo | emptyLine }
emptyLine { whitespace* newLineOrEOF }
newLineOrEOF { newline | eof }

Foo { "foo" whitespace Var whitespace? newLineOrEOF }
Var { identifier }

@tokens {
  singlespace { " " | "\t" }
  whitespace { singlespace+ }
  identifier { std.asciiLetter+ }
}

@external tokens newlines from "./newline-or-eof-example-tokens.js" { newline, eof }

with the external tokenizer file (newline-or-eof-example-tokens.js) like this:

import {ExternalTokenizer} from "lezer"
import {
  newline as newlineToken, eof
} from "./newline-or-eof-example.terms.js"

const newline = 10, carriageReturn = 13

export const newlines = new ExternalTokenizer((input, token, stack) => {
  let next = input.get(token.start)
  if (next < 0) {
    token.accept(eof, token.start)
  } else if (next === newline || next === carriageReturn) {
    token.accept(newlineToken, token.start + 1)
  }
}, {contextual: true, fallback: false})

The test script looks like this:

import { parser as newlineParser } from "./newline-example.js"
import { parser as newlineOrEofParser } from "./newline-or-eof-example.js"

console.log("works", newlineParser.parse("foo baz\n").toString())
console.log("works", newlineOrEofParser.parse("foo baz\n").toString())

console.log("fails", newlineParser.parse("foo baz").toString())
console.log("fails", newlineOrEofParser.parse("foo baz").toString())

and when run it outputs:

works NewLineExample(Foo(Var))
works NewlineAndEOFExample(Foo(Var))
fails NewLineExample(Foo(Var,⚠))
fails NewlineAndEOFExample(Foo(Var,⚠))

I’d appreciate any help I could get about how to fix the external tokenizer or how to match the end of file without an external tokenizer.

Working with Lezer and all the updates in CodeMirror 6 has been fantastic. Thanks for all the work!

It seems like the infinite loop you get when you put the external tokenizer before the regular tokens is caused by an infinite amount of empty lines being matched at the end of the grammar (which is reasonable — whitespace* newLineOrEOF will match when there’s no further input but your tokenizer can produce more eof tokens). That could be worked around by having separate rules for trailing empty lines and regular empty lines and doing something like (line | emptyLine)* trailingEmptyLine?.

I haven’t figured out why your external tokenizer isn’t being called at all in the grammar as you’ve shown it yet. Looking into that.

Following up about why the tokenizer wasn’t being called when below the main tokens—that’s because, with fallback=false, a tokenizer won’t run if any previous tokenizer produces actions, and the main tokenizer was producing the actions for eof here. So that seems reasonable.

Thanks for the quick feedback! I set fallback to false in the repo so it wouldn’t go into an infinite loop when the test was run.

I’ll update the grammar tonight (it’s a personal project) when I work on it again to test productions of something like (line | emptyLine)* trailingEmptyLine? that you mentioned.

I had a couple of minutes free so I was able to update the grammar with your suggested fix and the test works now:

@top FixedExample { (lineWithNewline | emptyLine)* lineWithoutNewline? }

lineWithNewline { Foo newLine }
lineWithoutNewline { Foo }
emptyLine { whitespace* newLine }
newLine { "\n" }

Foo { "foo" whitespace Var whitespace? }
Var { identifier }

@tokens {
  singlespace { " " | "\t" }
  whitespace { singlespace+ }
  identifier { std.asciiLetter+ }
}

with the test file as:

console.log("\n*** FIXED PARSER EXAMPLES:")
console.log("works", fixedParser.parse("").toString())
console.log("works", fixedParser.parse("\n").toString())
console.log("works", fixedParser.parse("\n\n").toString())
console.log("works", fixedParser.parse("foo baz").toString())
console.log("works", fixedParser.parse("foo baz\n").toString())
console.log("works", fixedParser.parse("foo baz\nfoo bar").toString())
console.log("works", fixedParser.parse("foo baz\nfoo bar\n\n").toString())

with the output being:

*** FIXED PARSER EXAMPLES:
works FixedExample
works FixedExample
works FixedExample
works FixedExample(Foo(Var))
works FixedExample(Foo(Var))
works FixedExample(Foo(Var),Foo(Var))
works FixedExample(Foo(Var),Foo(Var))

I’ve updated the GitHub repo with the fixed grammar in case this is useful for others in the future.

It has been 29 years since I my compilers class in undergrad so my tools are a little rusty. I got sidetracked to the external tokenizer after looking at the Python grammar instead of remembering how to match ε in a production. Thanks again for pointing me in the right direction!