How to match end of file in line oriented grammars?

DougMartin · May 11, 2021, 12:09pm

UPDATE: this has been resolved and the fixed grammar is in the repo linked in this message.

I’ve created a line-oriented grammar for a project I’m working on where each line is a separate statement in the language. I have it all working except for the last line which may not have a newline ("\n") in the input as it hits the end of file first. For example:

"foo baz\n" works vs "foo bar" which doesn’t because there is no final "\n"

I’ve looked at the Python grammar with its external tokenizer in tokens.js but when I try to create my own version of that for my grammar the parser it doesn’t work (and sometimes goes into an infinite loop and crashes the browser depending on how I set the fallback option). I must not be advancing the token stream or I’ve configured it wrong.

To help debug the problem I have created a GitHub repo with two simple line-oriented grammars, one with just newlines and one with an external tokenizer to match end of file. The repo is here and can be run via npm install && npm test:

the two example grammars are:

@top NewLineExample { line+ }

line { Foo | emptyLine }
emptyLine { whitespace* newLine }
newLine { "\n" }

Foo { "foo" whitespace Var whitespace? newLine }
Var { identifier }

@tokens {
  singlespace { " " | "\t" }
  whitespace { singlespace+ }
  identifier { std.asciiLetter+ }
}

and

@top NewlineAndEOFExample { line+ }

line { Foo | emptyLine }
emptyLine { whitespace* newLineOrEOF }
newLineOrEOF { newline | eof }

Foo { "foo" whitespace Var whitespace? newLineOrEOF }
Var { identifier }

@tokens {
  singlespace { " " | "\t" }
  whitespace { singlespace+ }
  identifier { std.asciiLetter+ }
}

@external tokens newlines from "./newline-or-eof-example-tokens.js" { newline, eof }

with the external tokenizer file (newline-or-eof-example-tokens.js) like this:

import {ExternalTokenizer} from "lezer"
import {
  newline as newlineToken, eof
} from "./newline-or-eof-example.terms.js"

const newline = 10, carriageReturn = 13

export const newlines = new ExternalTokenizer((input, token, stack) => {
  let next = input.get(token.start)
  if (next < 0) {
    token.accept(eof, token.start)
  } else if (next === newline || next === carriageReturn) {
    token.accept(newlineToken, token.start + 1)
  }
}, {contextual: true, fallback: false})

The test script looks like this:

import { parser as newlineParser } from "./newline-example.js"
import { parser as newlineOrEofParser } from "./newline-or-eof-example.js"

console.log("works", newlineParser.parse("foo baz\n").toString())
console.log("works", newlineOrEofParser.parse("foo baz\n").toString())

console.log("fails", newlineParser.parse("foo baz").toString())
console.log("fails", newlineOrEofParser.parse("foo baz").toString())

and when run it outputs:

works NewLineExample(Foo(Var))
works NewlineAndEOFExample(Foo(Var))
fails NewLineExample(Foo(Var,⚠))
fails NewlineAndEOFExample(Foo(Var,⚠))

I’d appreciate any help I could get about how to fix the external tokenizer or how to match the end of file without an external tokenizer.

Working with Lezer and all the updates in CodeMirror 6 has been fantastic. Thanks for all the work!

marijn · May 11, 2021, 12:49pm

It seems like the infinite loop you get when you put the external tokenizer before the regular tokens is caused by an infinite amount of empty lines being matched at the end of the grammar (which is reasonable — whitespace* newLineOrEOF will match when there’s no further input but your tokenizer can produce more eof tokens). That could be worked around by having separate rules for trailing empty lines and regular empty lines and doing something like (line | emptyLine)* trailingEmptyLine?.

I haven’t figured out why your external tokenizer isn’t being called at all in the grammar as you’ve shown it yet. Looking into that.

marijn · May 11, 2021, 1:14pm

Following up about why the tokenizer wasn’t being called when below the main tokens—that’s because, with fallback=false, a tokenizer won’t run if any previous tokenizer produces actions, and the main tokenizer was producing the actions for eof here. So that seems reasonable.

DougMartin · May 11, 2021, 1:43pm

Thanks for the quick feedback! I set fallback to false in the repo so it wouldn’t go into an infinite loop when the test was run.

I’ll update the grammar tonight (it’s a personal project) when I work on it again to test productions of something like (line | emptyLine)* trailingEmptyLine? that you mentioned.

DougMartin · May 11, 2021, 2:26pm

I had a couple of minutes free so I was able to update the grammar with your suggested fix and the test works now:

@top FixedExample { (lineWithNewline | emptyLine)* lineWithoutNewline? }

lineWithNewline { Foo newLine }
lineWithoutNewline { Foo }
emptyLine { whitespace* newLine }
newLine { "\n" }

Foo { "foo" whitespace Var whitespace? }
Var { identifier }

@tokens {
  singlespace { " " | "\t" }
  whitespace { singlespace+ }
  identifier { std.asciiLetter+ }
}

with the test file as:

console.log("\n*** FIXED PARSER EXAMPLES:")
console.log("works", fixedParser.parse("").toString())
console.log("works", fixedParser.parse("\n").toString())
console.log("works", fixedParser.parse("\n\n").toString())
console.log("works", fixedParser.parse("foo baz").toString())
console.log("works", fixedParser.parse("foo baz\n").toString())
console.log("works", fixedParser.parse("foo baz\nfoo bar").toString())
console.log("works", fixedParser.parse("foo baz\nfoo bar\n\n").toString())

with the output being:

*** FIXED PARSER EXAMPLES:
works FixedExample
works FixedExample
works FixedExample
works FixedExample(Foo(Var))
works FixedExample(Foo(Var))
works FixedExample(Foo(Var),Foo(Var))
works FixedExample(Foo(Var),Foo(Var))

I’ve updated the GitHub repo with the fixed grammar in case this is useful for others in the future.

It has been 29 years since I my compilers class in undergrad so my tools are a little rusty. I got sidetracked to the external tokenizer after looking at the Python grammar instead of remembering how to match ε in a production. Thanks again for pointing me in the right direction!

aukeroorda · December 17, 2024, 4:04pm

Hi - I’m running into a similar issue (I think) in my line-oriented grammar, but I don’t use an external tokenizer, just the build in one. This is a simple grammar that fails to compile when I include the @skip { eof_token }. Is there a more viable approach to allowing eof to occur anywhere?

@top Recipe { Line+ }

// @skip {eof_token}

Line {
  Inline? Newline
}

Inline {
  Non_newline+
}

@tokens {
  Newline { "\n" }
  Non_newline { ![\n]+ }
  // Newline_or_eof { "\n" | @eof }
  eof_token { @eof }
}

The Newline_or_eof token also causes the grammar to not compile. I’m testing this on https://lezer-playground.vercel.app/. Locally I get a compilation error, with an array key not existing if I try to compile this.

Kind regards,
Auke

marijn · December 17, 2024, 4:15pm

Skipping an EOF token (which doesn’t consume any actual characters) doesn’t seem like a useful thing to do. What are you trying to accomplish with that?

aukeroorda · January 3, 2025, 3:10pm

(Sorry for the late reply)

I’m trying to prevent parse-errors when my line-oriented grammar ends without a newline, i.e. the last line ends with any other accepted character:

Changing the grammar to accept either a newline or eof token at the end of lines, causes the grammar to not compile (on https://lezer-playground.vercel.app):

@top Recipe { Line+ }

// @skip {eof_token}

Line {
  Inline? Newline_or_eof // Breaks compilation when using Newline_or_eof instead of Newline
}

Inline {
  Non_newline+
}

@tokens {
  Newline { "\n" }
  Non_newline { ![\n]+ }
  Newline_or_eof { "\n" | @eof }
  eof_token { @eof }
}

I’m trying to make my parser accept ‘programs’ in my line-oriented grammar, regardless of whether they end with a final newline.

marijn · January 3, 2025, 4:51pm

That grammar compiles? (You definitely don’t want the commented-out skip rule.)

aukeroorda · January 3, 2025, 6:10pm

Ah, my bad. Locally (and therefore I assume on the playground as well) it compiles. It however hangs when I try to parse the following lines of demo-text:

Line with newline
Line with eof instead of newline

(But then without the trailing newline on the second line).

I’ve removed the commented-out skip rule, and locally compiled the grammar. This indeed works fine (repo at: GitHub - aukeroorda/lezer-line-based-grammar).
I use the compiled grammar to test whether parsing succeeds as follows (test-grammar.js):

import { parser } from "./dist/index.js";

const test_str_1 =
  `Line with newline
  Line with eof instead of newline`

console.log(parser.parse(test_str_1).toString());

However, when trying to parse this demo text, the parser crashes:

> node test-grammar.js
file:///Users/auke/lezer-line-based-grammar/node_modules/@lezer/lr/dist/index.js:203
            this.buffer.push(term, start, end, size);
                        ^

RangeError: Invalid array length
    at Array.push (<anonymous>)
    at Stack.storeNode (file:///Users/auke/lezer-line-based-grammar/node_modules/@lezer/lr/dist/index.js:203:25)
    at Stack.reduce (file:///Users/auke/lezer-line-based-grammar/node_modules/@lezer/lr/dist/index.js:167:18)
    at Parse.advanceStack (file:///Users/auke/lezer-line-based-grammar/node_modules/@lezer/lr/dist/index.js:1426:19)
    at Parse.advance (file:///Users/auke/lezer-line-based-grammar/node_modules/@lezer/lr/dist/index.js:1313:31)
    at LRParser.parse (file:///Users/auke/lezer-line-based-grammar/node_modules/@lezer/common/dist/index.js:1738:30)
    at file:///Users/auke/lezer-line-based-grammar/test-grammar.js:7:20
    at ModuleJob.run (node:internal/modules/esm/module_job:271:25)
    at async onImport.tracePromise.__proto__ (node:internal/modules/esm/loader:547:26)
    at async asyncRunEntryPointWithESMLoader (node:internal/modules/run_main:116:5)

Node.js v23.3.0

I think this is related to the statement you made earlier: ‘EOF token (which doesn’t consume any actual characters)’; Could this be causing some infinite loop/recursion/allocation?

I’m sorry for the ‘does not compile’ statement earlier. This was based on the results I got on the playground.

marijn · January 3, 2025, 9:24pm

This will match again and again at the end of the file without consuming input. Something like Inline? Newline | Inline eof might work better.

aukeroorda · January 3, 2025, 9:53pm

Thanks! That solves it indeed! For others, this is the result:

@top Lines { Line* }

Line {
  Inline? Newline | Inline Eof
}

Inline {
  Non_newline+
}

@tokens {
  Newline { "\n" }
  Non_newline { ![\n]+ }
  Eof { @eof }
}

This parses an empty document, a line without a newline, or a document with multiple lines, regardless of whether the last character is a newline! Thanks a lot!