Capturing multiline values in a lezer token

Use case

I am trying to write a simplified YAML grammar (yes, I understand that simple and YAML don’t really mix :smile:).

One element that I am finding to be tricky is dealing with multiline strings that essentially “capture” indented content

# | marks start of the multiline string
multi: |
  line 1
  line 2
# this newline at a lower indent level ends the multiline string
simple: value

What i’ve tried

I’ve gotten this to parse by using the tokenizer/context from indent example but there are parse errors and the dedent tokens aren’t actually working.

Here’s a stackblitz example to see live.

The problem

For this input:

key: value_one
multi: |
  line 1
  line 2
key_two: hi

It generates a parse tree like so:

Doc (key: value_one\nmulti: |\n  line 1\n  line 2\nkey_two: hi)
  Property (key: value_one\n)
    Key (key:)
    Value ( value_one)
  Property (multi: |\n  line 1\n  line 2\n)
    Key (multi:)
    MultiLineExp ( |\n  line 1\n  line 2)
      MultiLineKey ( |)
      ⚠ (\n)
      Value (line 1)
      ⚠ (\n  )
      Value (line 2)
      ⚠ ()
  Property (key_two: hi)
    Key (key_two:)
    Value ( h)
    ⚠ (i)

With errant Value and \n and errors. I really want everything from MultiLineKey to the next Property to be a MultiLineExp (which I think is what the dedent is supposed to do?)

The issue seems to be Value and the _ with the following declaration:

element {
  Property { Key (Value | MultiLineExp) lineEnd } 
}

MultiLineExp { MultiLineKey indent Value* (dedent | eof) }

Value { (![|#"\n] _)+ }

It seems like _ is somehow confusing the external tokenizer and no dedent token is being emitted.

My Questions

  1. Is there a more natural way of expressing this multiline behavior that isn’t as reliant on an the external tokenizer?
  2. If a tokenizer is the only way forward, what is the best way to debug issues with the logic? The interplay of grammar and tokenizer is difficult for me to understand :smile:

More context

Here is more extended version of my grammar (see the stackblitz for the whole working thing)

// simplified, see demo for full example
// https://stackblitz.com/edit/js-az5b2u?file=index.js%3AL5
@top Doc { element* }

element {
  Property { Key (Value | MultiLineExp) lineEnd } 
}

MultiLineExp { MultiLineKey indent Value* (dedent | eof) }

lineEnd { newline | eof }

@context trackIndent from "./tokens.js"
@external tokens Indentation from "./tokens.js" {
  indent
  dedent
  blankLineStart
}

@tokens {
  // ...
  Key { (@asciiLetter | "_")* ":" }
  Value { (![|#"\n] _)+ }
  // ...
}

You aren’t accounting for the newlines in your parse rules (i.e. it looks like it should be MultiLineExp { MultiLineKey newline indent (Value (newline | eof))* (dedent | eof) }). Also Value weirdly matches an additonal random character after every character in the matched set.

And your hard-coded term IDs aren’t accurate. @lezer/generator 1.4.0 adds support for passing a function as contextTracker, so that you don’t have to hard-code things like that.

And your hard-coded term IDs aren’t accurate. @lezer/generator 1.4.0 adds support for passing a function as contextTracker , so that you don’t have to hard-code things like that.

Wonderful! I was wondering why contextTracker didn’t allow for this.

You aren’t accounting for the newlines in your parse rules

Ah, thank you! That makes sense; I’ve implemented your suggestion. That has resolved the dedent detection issue.

However…

The plot thickens: even number failures

I’ve noticed that even numbers of characters in my Value token lead to strange behavior

key: 12
multi: |
  EVEN VALUES BROKEN

Leads to

Doc (key: 12\nmulti: |\n  EVEN VALUES BROKEN)
  Property (key: 12\nmulti: |\n)
    Key (key:)
    Value ( 12\nmulti: |\n)
  ⚠ (  EVEN)
  Property (EVEN VALUES BROKEN)
    ⚠ ()
    Value (EVEN VALUES BROKEN)

Whereas If I put

key: 1
multi: |
  BROKEN

I get:

Doc (key: 1\nmulti: |\n  ODD VALUES WORKING)
  Property (key: 1\n)
    Key (key:)
    Value ( 1\n)
  Property (multi: |\n  ODD VALUES WORKING)
    Key (multi:)
    MultiLineExp ( |\n  ODD VALUES WORKING)
      MultiLineKey ( |)
      MultiLineValue (ODD VALUES WORKING)

(a proper MultiLine expression).

A selection of my updated grammar (updated with your suggestions) is here (full sample here)

@top Doc { element* }

element {
  Property { Key " "? ((Value (newline | eof)) | MultiLineExp) } 
}

MultiLineExp { 
  MultiLineKey newline indent (MultiLineValue (newline | eof))* (dedent | eof) 
}

@tokens {
  // ...
  MultiLineValue { (![|#"\n]_)+ }
  Value { (![|#"\n] _)+ }
  @precedence {MultiLineValue, Comment, Value}
  @precedence {MultiLineKey, Value}
}

This happens with stacked Values and Value MultiLineValue so I don’t think it’s related to the indentation parser. There seems to be something wrong with using _ (and maybe the root of the weird character capture issue you mentioned)?

Changing to

Value { (@asciiLetter | @digit | "_")+ }

Seems to resolve things, but unfortunately my input set is much wider than this (basically any character besides \n).

Update: seems to be related to the interplay between Key and Value actually…

# Key " "? means odd number values break
# Key " " means even number values break
Property { Key " " ((Value (newline | eof)) | MultiLineExp) }

Look at your token definition:

I’m not sure what you are trying to encode, but it looks really wrong.

looks really wrong

Yes :smile:

What i’m trying to express is the “plain” YAML type

plain: plain value {}

# also valid but potentially easier to capture
string: "string value {}"
multiline_key: |
  multi line
  value
number: 1234

Where plain is essentially a sequence of @asciiLetter and any non \n or # character until the end of the line. That’s obviously not quite working though. In. Regex, i’d just do something like \w+: (\w[^\n]+) but that’d doesn’t seem to be how this capture group is working. It feels like i’m doing something fundamentally wrong, but I don’t understand what it is :sweat_smile:

Is there a way to do this with a plain token expression? Or do I need to reach out for an external tokenizer for something like a plainString token that can match this sort of text?

Hmm, reworking things a little:

Plain { (@asciiLetter|@digit) ![\n#]* }

This seems to help. Apparently the _ was the culprit.

@top Doc { element* }

element {
  Property { Key " "? ((scalar (newline | eof)) | MultiLineExp) } 
}

scalar {
  Plain | Number
}

// ...

@tokens {
  // ...
  Number { @digit+ }
  Plain { (@asciiLetter|@digit) ![\n#]* }
  @precedence {Number, Plain}
}

I’ll continue with this approach unless there’s a better more lezer-y way to do it :smile_cat: