Capturing multiline values in a lezer token

NickTomlin · August 10, 2023, 10:26pm

Use case

I am trying to write a simplified YAML grammar (yes, I understand that simple and YAML don’t really mix ).

One element that I am finding to be tricky is dealing with multiline strings that essentially “capture” indented content

# | marks start of the multiline string
multi: |
  line 1
  line 2
# this newline at a lower indent level ends the multiline string
simple: value

What i’ve tried

I’ve gotten this to parse by using the tokenizer/context from indent example but there are parse errors and the dedent tokens aren’t actually working.

Here’s a stackblitz example to see live.

The problem

For this input:

key: value_one
multi: |
  line 1
  line 2
key_two: hi

It generates a parse tree like so:

Doc (key: value_one\nmulti: |\n  line 1\n  line 2\nkey_two: hi)
  Property (key: value_one\n)
    Key (key:)
    Value ( value_one)
  Property (multi: |\n  line 1\n  line 2\n)
    Key (multi:)
    MultiLineExp ( |\n  line 1\n  line 2)
      MultiLineKey ( |)
      ⚠ (\n)
      Value (line 1)
      ⚠ (\n  )
      Value (line 2)
      ⚠ ()
  Property (key_two: hi)
    Key (key_two:)
    Value ( h)
    ⚠ (i)

With errant Value and \n and errors. I really want everything from MultiLineKey to the next Property to be a MultiLineExp (which I think is what the dedent is supposed to do?)

The issue seems to be Value and the _ with the following declaration:

element {
  Property { Key (Value | MultiLineExp) lineEnd } 
}

MultiLineExp { MultiLineKey indent Value* (dedent | eof) }

Value { (![|#"\n] _)+ }

It seems like _ is somehow confusing the external tokenizer and no dedent token is being emitted.

My Questions

Is there a more natural way of expressing this multiline behavior that isn’t as reliant on an the external tokenizer?
If a tokenizer is the only way forward, what is the best way to debug issues with the logic? The interplay of grammar and tokenizer is difficult for me to understand

More context

Here is more extended version of my grammar (see the stackblitz for the whole working thing)

// simplified, see demo for full example
// https://stackblitz.com/edit/js-az5b2u?file=index.js%3AL5
@top Doc { element* }

element {
  Property { Key (Value | MultiLineExp) lineEnd } 
}

MultiLineExp { MultiLineKey indent Value* (dedent | eof) }

lineEnd { newline | eof }

@context trackIndent from "./tokens.js"
@external tokens Indentation from "./tokens.js" {
  indent
  dedent
  blankLineStart
}

@tokens {
  // ...
  Key { (@asciiLetter | "_")* ":" }
  Value { (![|#"\n] _)+ }
  // ...
}

marijn · August 11, 2023, 4:20pm

You aren’t accounting for the newlines in your parse rules (i.e. it looks like it should be MultiLineExp { MultiLineKey newline indent (Value (newline | eof))* (dedent | eof) }). Also Value weirdly matches an additonal random character after every character in the matched set.

And your hard-coded term IDs aren’t accurate. @lezer/generator 1.4.0 adds support for passing a function as contextTracker, so that you don’t have to hard-code things like that.

NickTomlin · August 11, 2023, 6:25pm

And your hard-coded term IDs aren’t accurate. @lezer/generator 1.4.0 adds support for passing a function as contextTracker , so that you don’t have to hard-code things like that.

Wonderful! I was wondering why contextTracker didn’t allow for this.

You aren’t accounting for the newlines in your parse rules

Ah, thank you! That makes sense; I’ve implemented your suggestion. That has resolved the dedent detection issue.

However…

The plot thickens: even number failures

I’ve noticed that even numbers of characters in my Value token lead to strange behavior

key: 12
multi: |
  EVEN VALUES BROKEN

Leads to

Doc (key: 12\nmulti: |\n  EVEN VALUES BROKEN)
  Property (key: 12\nmulti: |\n)
    Key (key:)
    Value ( 12\nmulti: |\n)
  ⚠ (  EVEN)
  Property (EVEN VALUES BROKEN)
    ⚠ ()
    Value (EVEN VALUES BROKEN)

Whereas If I put

key: 1
multi: |
  BROKEN

I get:

Doc (key: 1\nmulti: |\n  ODD VALUES WORKING)
  Property (key: 1\n)
    Key (key:)
    Value ( 1\n)
  Property (multi: |\n  ODD VALUES WORKING)
    Key (multi:)
    MultiLineExp ( |\n  ODD VALUES WORKING)
      MultiLineKey ( |)
      MultiLineValue (ODD VALUES WORKING)

(a proper MultiLine expression).

A selection of my updated grammar (updated with your suggestions) is here (full sample here)

@top Doc { element* }

element {
  Property { Key " "? ((Value (newline | eof)) | MultiLineExp) } 
}

MultiLineExp { 
  MultiLineKey newline indent (MultiLineValue (newline | eof))* (dedent | eof) 
}

@tokens {
  // ...
  MultiLineValue { (![|#"\n]_)+ }
  Value { (![|#"\n] _)+ }
  @precedence {MultiLineValue, Comment, Value}
  @precedence {MultiLineKey, Value}
}

This happens with stacked Values and Value MultiLineValue so I don’t think it’s related to the indentation parser. There seems to be something wrong with using _ (and maybe the root of the weird character capture issue you mentioned)?

Changing to

Value { (@asciiLetter | @digit | "_")+ }

Seems to resolve things, but unfortunately my input set is much wider than this (basically any character besides \n).

Update: seems to be related to the interplay between Key and Value actually…

# Key " "? means odd number values break
# Key " " means even number values break
Property { Key " " ((Value (newline | eof)) | MultiLineExp) }

marijn · August 12, 2023, 9:37am

Look at your token definition:

I’m not sure what you are trying to encode, but it looks really wrong.

NickTomlin · August 12, 2023, 7:59pm

looks really wrong

Yes

What i’m trying to express is the “plain” YAML type

plain: plain value {}

# also valid but potentially easier to capture
string: "string value {}"
multiline_key: |
  multi line
  value
number: 1234

Where plain is essentially a sequence of @asciiLetter and any non \n or # character until the end of the line. That’s obviously not quite working though. In. Regex, i’d just do something like \w+: (\w[^\n]+) but that’d doesn’t seem to be how this capture group is working. It feels like i’m doing something fundamentally wrong, but I don’t understand what it is

Is there a way to do this with a plain token expression? Or do I need to reach out for an external tokenizer for something like a plainString token that can match this sort of text?

NickTomlin · August 12, 2023, 8:30pm

Hmm, reworking things a little:

Plain { (@asciiLetter|@digit) ![\n#]* }

This seems to help. Apparently the _ was the culprit.

@top Doc { element* }

element {
  Property { Key " "? ((scalar (newline | eof)) | MultiLineExp) } 
}

scalar {
  Plain | Number
}

// ...

@tokens {
  // ...
  Number { @digit+ }
  Plain { (@asciiLetter|@digit) ![\n#]* }
  @precedence {Number, Plain}
}

I’ll continue with this approach unless there’s a better more lezer-y way to do it