external tokenizer: acceptToken is not adding token at end of string

i have wasted several hours now on this trivial bug …
i hope someone (marijn?) can help me fix this in a few minutes
(how can i “buy you a coffee”?)

i want to parse “string blocks” in the nix language

''
  stringcontent
''

i have based my parser on the javascript parser for template strings

the parser works as long as there is a newline at the end of string

when i parse

''stringcontent''

then the stringcontent is missing in the parse tree

reproduce

git clone https://github.com/milahu/lezer-nix --depth 1 \
  --branch repro-bug-stringblock-missing-end-of-string
cd lezer-nix
npm install
npm run build
npm run test # fail: string block with interpolation single line
node test/manual-test.js "$(printf "''x\n''")" # pass
node test/manual-test.js "$(printf "''x''")" # fail

debug output

node test/manual-test.js "$(printf "''x\n''")"
i = 0 + next = 120 = "x" + afterQuote = false + afterDollar = false
i = 1 + next = 10 = "\n" + afterQuote = false + afterDollar = false
  acceptToken(StringBlockContent) from newline
  break 73
i = 0 + next = 39 = "'" + afterQuote = false + afterDollar = false
  found singlequote 1
i = 1 + next = 39 = "'" + afterQuote = true + afterDollar = false
  found singlequote 2
  acceptToken(stringBlockEnd) with empty string
  break 50
Nix
  StringBlock "''x\n''"
    StringBlockContent "x\n"
node test/manual-test.js "$(printf "''x''")"
i = 0 + next = 120 = "x" + afterQuote = false + afterDollar = false
i = 1 + next = 39 = "'" + afterQuote = false + afterDollar = false
  found singlequote 1
i = 2 + next = 39 = "'" + afterQuote = true + afterDollar = false
  found singlequote 2
  acceptToken(StringBlockContent, -2)
  acceptToken(stringBlockEnd)
  break 50
Nix
  StringBlock ''x''

in the second case, the StringBlockContent is missing

more debug output

when i add this to node_modules/@lezer/lr/dist/index.js

console.log(`@lezer/lr/dist/index.js 575 this.token = ${JSON.stringify(this.token)}`) // debug

then i see: in the broken case (''x''), the tokens are overlapping

node test/manual-test.js "$(printf "''x\n''")"
...
  71 acceptToken(StringBlockContent) from newline
@lezer/lr/dist/index.js 575 this.token:
{"start":2,"value":1,"end":4,"extended":-1,"lookAhead":4,"mask":0,"context":0}
...
  40 acceptToken(stringBlockEnd) with empty string
@lezer/lr/dist/index.js 575 this.token:
{"start":4,"value":28,"end":6,"extended":-1,"lookAhead":5,"mask":1,"context":0}
node test/manual-test.js "$(printf "''x''")"
...
  44 acceptToken(StringBlockContent, -2)
@lezer/lr/dist/index.js 575 this.token:
{"start":2,"value":1,"end":3,"extended":-1,"lookAhead":4,"mask":0,"context":0}
  46 acceptToken(stringBlockEnd)
@lezer/lr/dist/index.js 575 this.token:
{"start":2,"value":28,"end":5,"extended":-1,"lookAhead":4,"mask":0,"context":0}

both tokens have "start":2

full output
node test/manual-test.js "$(printf "''x\n''")"
@lezer/lr/dist/index.js 575 this.token:
{"start":0,"value":38,"end":2,"extended":-1,"lookAhead":2,"mask":0,"context":0}
i = 0 + next = 120 = "x" + afterQuote = false + afterDollar = false
  80 advance -> continue loop
i = 1 + next = 10 = "\n" + afterQuote = false + afterDollar = false
  71 acceptToken(StringBlockContent) from newline
@lezer/lr/dist/index.js 575 this.token:
{"start":2,"value":1,"end":4,"extended":-1,"lookAhead":4,"mask":0,"context":0}
  73 break
i = 0 + next = 39 = "'" + afterQuote = false + afterDollar = false
  54 found singlequote 1
  80 advance -> continue loop
i = 1 + next = 39 = "'" + afterQuote = true + afterDollar = false
  36 found singlequote 2
  40 acceptToken(stringBlockEnd) with empty string
@lezer/lr/dist/index.js 575 this.token:
{"start":4,"value":28,"end":6,"extended":-1,"lookAhead":5,"mask":1,"context":0}
  50 break
Nix
  StringBlock "''x\n''"
    StringBlockContent "x\n"
node test/manual-test.js "$(printf "''x''")"
@lezer/lr/dist/index.js 575 this.token:
{"start":0,"value":38,"end":2,"extended":-1,"lookAhead":2,"mask":0,"context":0}
i = 0 + next = 120 = "x" + afterQuote = false + afterDollar = false
  80 advance -> continue loop
i = 1 + next = 39 = "'" + afterQuote = false + afterDollar = false
  54 found singlequote 1
  80 advance -> continue loop
i = 2 + next = 39 = "'" + afterQuote = true + afterDollar = false
  36 found singlequote 2
  44 acceptToken(StringBlockContent, -2)
@lezer/lr/dist/index.js 575 this.token:
{"start":2,"value":1,"end":3,"extended":-1,"lookAhead":4,"mask":0,"context":0}
  46 acceptToken(stringBlockEnd)
@lezer/lr/dist/index.js 575 this.token:
{"start":2,"value":28,"end":5,"extended":-1,"lookAhead":4,"mask":0,"context":0}
  50 break
Nix
  StringBlock ''x''
ContextTracker?

i guess i need a ContextTracker

  • parse end of string content with input.peek()
  • save state for next parse loop
  • break the current parse loop, so the next token has the right start

workaround: save state in the input object as input._todoStringBlockEnd
see 7e8ac69

Sorry, that’s too much gnarly code for me to take the time to figure out what’s going on. If you can distill this to a minimal setup (without the hack, which as you’re probably aware isn’t going to reliably work), I could take a look.

see my branch reduce-to-stingblock-parser

so far its working …

i just needed to pass my parse state
to the next call of the stringBlock external tokenizer

as i understand,
i must return from the external tokenizer function, to start the next token,
in this case, this means breaking the for loop

to visualize

'
'
x cursor
' next
' peek1
  peek2 = EOF

at cursor, when i see two singlequotes in next (line 44) and peek1 (line 53), i want to

  1. accept token StringBlockContent = x (line 62)
  2. remember the private parse state for the next call (line 64)
  3. break = return externalParser, to start a new token (line 72)
  4. accept token StringBlockEnd = '' (line 45)
  5. clear the private parse state (line 50)

Why do you need to remember parse state in this case? There will still be two single quotes ahead when your tokenizer runs the next time, so it will already see that it is at the end of a string and emit the proper token. If I remove the hack your tests still pass.

(In case it isn’t clear—custom tokenizers are only ran in states where at least one of the tokens they emit is valid, so your string block tokenizer, which only produces tokens used inside string blocks, won’t run at the start of a string block.)

1 Like

because i dont call input.advance() before i break/return
and in the next call, there will be i == 0

yepp, the start token is parsed by lezer

thanks : )

Sure, but that’s not a problem, because you know that on '' you need to return a token even when i is zero, right?

yes, sorry i was too brief, i mean:

i can simply replace input._todoStringBlockEnd == true with i == 0

Yes, that should work.

i found a bug in my code

-      if (input.peek() == singlequote) {
+      if (input.peek(1) == singlequote) {

i was assuming that input.peek() returns the char after input.next
but input.peek() always return singlequote in my case

looks like a bug in lezer

the case offset == undefined is not handled

node
> undefined + 0
NaN

-peek(offset: number) {
+peek(offset: number = 0) {

?

There is no offset == undefined case, because the type of that parameter is number and it is not optional. Not passing it is a caller error.