Splitting YAML markdown from the rest of a document, ambiguity

I want to split a document into two nodes (yaml and markdown), and then use parsedMixed to handle each section. That part works, but I’m having trouble with the actual splitting.

Here’s a simplified version:

@top Frontmatter { Yaml? Markdown }

@tokens {
  anything { _ }
  yamlStart { '---\n' }
  yamlEnd { '---' | '...' }
}
Markdown { anything+ }
Yaml { yamlStart anything* yamlEnd }

The @top will actually end up being something more like { YamlWithNewline Markdown | Markdown | YamlWithoutNewline }, but I’m simplifying.

This kind of works, but this case:

---

hello

will always parsed as Frontmatter(Yaml(⚠️)), where I’d want it treated as markdown until yamlEnd was found, rather than being treated as broken yaml.

I thought that the way to solve this was to mark the Markdown and Yaml sections as ambiguous, and give the yaml dynamic precedence, thinking that when it hit that error, the markdown branch would win out instead, but that doesn’t change the results, so I must be misunderstanding.

All help appreciated

1 Like

I think that’s a case where LR parsers just won’t do well, since it requires arbitrary look-ahead to figure out how to interpret something. GLR might not help a lot either, since the ambiguity can take very long to resolve if the frontmatter is long, and Lezer will stop one of the branches after (currently) 500 tokens or productions of both running without error.

This might be a case where a completely custom parser or a kludge with an external tokenizer is called for. (Though with those, I’m also not sure how to make that incremental.)

I wasn’t able to resolve it with GLR, even with very short (<500 token) content. It seemed that the error in the Yaml part wasn’t enough to cause it to be parsed as Markdown instead, but maybe it had already discounted it for some reason?

Since I’m trying to break the document into a maximum of 4 total tokens before using parseMixed (essentially Frontmatter(Yaml(content),Markdown(content))), is it possible to solve this with dynamic precedence/ambiguity markers? Do the inner mixed mode parsers count toward the token limit?

Your anything { _ } rule will match one token per character, plus the wrappers created by the repeat operator. Making your tokens span a whole line might help there (but still won’t solve the issue if the frontmatter is a few hundred lines long).

I was thinking that could be replaced with _+ to keep tokens below, say, 10.

I haven’t been able to solve this at all with the regular grammar (which is maybe expected), mostly because I haven’t been able make repeats non-greedy. This mean the following was always parsed as Yaml() instead of Yaml(), Markdown() as intended:

---
yaml
---
non-yaml with a horizontal rule below
---

I’ve replaced the whole thing with a single external Yaml token that reads from inputStream.input.read() once (only when input.pos is 0) and tests it with an easy regex. This is a hack, but seems to be working.

I think it could also be handled before the parser with something like autolanguage, and then use different languages for yaml-only, yaml-as-frontmatter, and no-yaml documents to resolve the ambiguity.