cm6: Integrate yaml and markdown

teucer · June 7, 2022, 4:04pm

I have yaml documents where values are in markdown.

How can I Integrate both parsers (yaml legacy mode and markdown lezer)?

pranav · June 7, 2022, 5:09pm

You could write an extension for the Markdown parser and pass it to the @codemirror/lang-markdown extension’s configuration. I did a similar effort for YAML frontmatter, you can check it here for reference.

marijn · June 7, 2022, 6:26pm

I think an extension for the Markdown parser is the wrong direction for this — you’ll want a YAML parser and then use parseMixed to parse some content as Markdown, not embed YAML in Markdown structure. But currently the only YAML parser is a simple legacy mode, which cannot be used as a basis for parseMixed, unfortunately.

teucer · June 7, 2022, 10:20pm

I tried to write a markdown extension by emulating how numbered lists are handled in the markdown parser. This seems to be challenging. Also it does not look right, as the document is ultimately a yaml file.

I am a little bit familiar with lark and thought I could translate this grammar to lezer.

Assuming this possible (I can stick to simpler yaml files) how do I use parseMixed? Any examples?

marijn · June 8, 2022, 5:59am

That grammar looks suspiciously simple. The tree sitter grammar for YAML is a lot more messy, which aligns with that I’d expect for such a format.

If I understand correctly that you’d want to parse the content of some literals as Markdown syntax, the use of parseMixed here would look similar to the one the html parser uses to nest parsers for tag content (though without the tag-recognizing logic).

teucer · June 10, 2022, 2:11pm

Below is my attempt to create a simplified grammar (I don’t need to support full yaml)

Keys are snake case ascii letters and digits
No handling of tags, anchors etc.
No handling of nested properties
No handling of block Sequence or Mapping
Tried to mix json and python grammars

Getting Inconsistent skip sets after newline Key ": " string. I have an issue with new lines in LongString.

@marijn
Could you please have look and provide me some pointers?

@top YamlText { document }

@skip { space | newlineEmpty | Comment }

document { "---"? newline (property newline dedent)+ "..."? eof }

property { Key ": " value }

value { Scalar | LongString  | Mapping | Sequence }

Scalar { True | False | Null | Number | SimpleString }
SimpleString { string }

LongString { quote? Multiline quote? | BlockString }

@skip {} {
  Multiline { string (newline+ indent string)* newline (dedent | eof) }
  BlockString {
    Op space newline+
    indent Multiline
    (dedent | eof)
  }
}

Mapping { "{" commaSep<Scalar>? "}" }

Sequence { "[" commaSep<Scalar>? "]" }

@context trackIndent from "./tokens.js"

@external tokens indentation from "./tokens" { indent, dedent }

@tokens {
  True  { "true" }
  False { "false" }
  Null  { "null" | "~" }

  Op { "|" | ">" | "|-" | ">-" | "|+" | ">+" }

  Number { "-"? (int | int? frac?) exp?  }
  int  { "0" | $[1-9] std.digit* }
  frac { "." (std.digit+ | ".nan" | ".inf")  }
  exp  { $[eE] $[+\-]? std.digit+ }

  string { char* }
  char { $[\u{20}\u{21}\u{23}-\u{5b}\u{5d}-\u{10ffff}] | "\\" esc }
  esc  { $["\\\/bfnrt] | "u" hex hex hex hex }
  hex  { $[0-9a-fA-F] }

  Key { keyChar (std.digit | keyChar)* }
  keyChar { std.asciiLetter | "_" }

  quote { '"' }
  Comment { "#" ![\n\r]* }
  space { ($[ \t\f] | "\\" $[\n\r])+ }

  "{" "}" "[" "]" ":" "|" ">" "+" "-"
}

commaSep<expr> { expr ("," expr)* }

@external tokens newlines from "./tokens" { newline, newlineEmpty, eof }

@external propSource jsonHighlighting from "./highlight"

@detectDelim

marijn · June 13, 2022, 9:56am

It doesn’t know which set of skip rules to use after ": " string, which may be either the end of a SimpleString, or inside a Multiline rule (which has local skip rules). This may be a language where putting the whitespace explicitly into the grammar, rather than using skip rules, is appropriate, since as far as I understand it the meaning of whitespace differs quite a bit depending on context in YAML.

teucer · June 13, 2022, 10:29pm

@marijn thank you for the reply.

I managed to get the basic version working yaml.grammar.

I have a more advanced version that I will post shortly.

I have a couple questions on the grammar:

It would be helpful to have a repetition operator on the tokens, e.g. how do I succinctly create a token matching dates with format “2022-06-13”? My current solution is to manually repeat digits 4 times for the year. Is there another way?
How do I express that I am at the start of the line (^ in regex) ? I have implemented it with an external tokenizer. This is useful when I want to indicate that the current block mapping has ended.
How do I tell the parser to completely ignore tokens in certain contexts? E.g. I don’t want to parse comments in plain multiline strings. Currently I am using the highlighter to fix ex-post. Would adding a new cobstruct '@ignore" (like @skip) work?
There is an undocumented functionality: ( ![\n\r] | "\\" _ ). My interpretation is that “it does not contain newline, but if it does it is escaped”. Is that correct?
Even though I have declared “~” as null, it is not parsed as such. Any pointers?

marijn · June 14, 2022, 6:57am

No, there’s no counted repetition operator. It just doesn’t come up often enough.

Set up your grammar so that the token only appears after a newline, I guess.

Same answer: make sure the grammar only recognizes that token in that context.

All of the syntax in that expression is documented. Do you mean the ![] part? That’s a negated character set.

I don’t know what declaring something as null means here.

teucer · June 14, 2022, 11:33am

On 4., could you please explain the meaning of (![\n\r] | "\\" _ )? I don’t get what _ refers to.

marijn · June 14, 2022, 12:12pm

_ means “any character”

teucer · June 14, 2022, 2:27pm

Ok understood thx!

I have created scalarProperty as follows:

scalarProperty { Key ": " scalar newline }
scalar { Boolean | Null | Date | Number | Plain | String }

The issue is that Plain (unquoted string) overlaps with pretty much everything. I have tried to address it by setting token @precedence, e.g.

@precedence{Date, Number, Plain}
@precedence {Boolean, Plain}
...

Now when I enter key: 2022-06-\n, 2022 is interpreted as a Number and -06- as unknown.

I don’t understand why it is split: shouldn’t 2022-06- be parsed as Plain?