Understanding Lezer: Capitalization breaks grammar

samdesota · February 2, 2021, 6:10pm

I don’t think this is a bug, but perhaps?

I can get my grammar to work when I use lower case for “element-attribute” like so:

@top Atom { Text | Element }

Element { tag-name element-attributes }

element-attributes { "[" element-attribute "]" }

element-attribute { tag-name }

Text { string }

@tokens {
  tag-name { $[a-z0-9#\-\.]+ '+'? }

  string { '"' char* '"' }
  char { $[\u{20}\u{21}\u{23}-\u{5b}\u{5d}-\u{10ffff}] | "\\" esc }
  esc  { $["\\\/bfnrt] | "u" hex hex hex hex }
  hex  { $[0-9a-fA-F] }

  ws { $[ \n\r\t] }
}

@detectDelim

Give an input like “div[disabled]”, this produces

{
  "type": {
    "name": "Atom",
    "props": {},
    "id": 1,
    "flags": 1
  },
  "children": [
    {
      "type": {
        "name": "Element",
        "props": {},
        "id": 3,
        "flags": 0
      },
      "children": [],
      "positions": [],
      "length": 13
    }
  ],
  "positions": [
    0
  ],
  "length": 13
}

However, I’d like “ElementAttribute” to be part of the syntax tree, so I capitalize:

@top Atom { Text | Element }

Element { tag-name element-attributes }

element-attributes { "[" ElementAttribute "]" }

ElementAttribute { tag-name }

Text { string }

@tokens {
  tag-name { $[a-z0-9#\-\.]+ '+'? }

  string { '"' char* '"' }
  char { $[\u{20}\u{21}\u{23}-\u{5b}\u{5d}-\u{10ffff}] | "\\" esc }
  esc  { $["\\\/bfnrt] | "u" hex hex hex hex }
  hex  { $[0-9a-fA-F] }

  ws { $[ \n\r\t] }
}

@detectDelim

Now the output given “div[disabled]” is ambiguous:

{
  "type": {
    "name": "Atom",
    "props": {},
    "id": 1,
    "flags": 1
  },
  "children": [
    {
      "buffer": {
        "0": 3,
        "1": 0,
        "2": 13,
        "3": 8,
        "4": 4,
        "5": 4,
        "6": 12,
        "7": 8
      },
      "length": 13,
      "set": {
        "types": [
          {
            "name": "⚠",
            "props": {},
            "id": 0,
            "flags": 6
          },
          {
            "name": "Atom",
            "props": {},
            "id": 1,
            "flags": 1
          },
          {
            "name": "Text",
            "props": {},
            "id": 2,
            "flags": 0
          },
          {
            "name": "Element",
            "props": {},
            "id": 3,
            "flags": 0
          },
          {
            "name": "ElementAttribute",
            "props": {},
            "id": 4,
            "flags": 0
          }
        ]
      },
      "type": {
        "name": "",
        "props": {},
        "id": 0,
        "flags": 8
      }
    }
  ],
  "positions": [
    0
  ],
  "length": 13
}

I’m confused on how this subtle difference causes the grammar to break. Perhaps I’m missing something fundamental about LR, would appreciate any pointers.

marijn · February 2, 2021, 7:54pm

This does look like a bug—lowercase rules may be inlined, which may cause this difference somehow, but that shouldn’t affect the matched language. I’ll investigate.

marijn · February 3, 2021, 11:29am

I can’t reproduce this—your second grammar parses div[disabled] just fine too, when I try it, using lezer-generator 0.13.2 and the script below…

let gen = require("lezer-generator")

let p = gen.buildParser(require("fs").readFileSync("bug.grammar", "utf8"))

console.log(p.parse("div[disabled]") + "")

samdesota · February 3, 2021, 2:38pm

Strange, that exact code still produces a bad parse fo me on Node 12.2.0 with lezer-generator 0.13.2. Let me see if I can de-produce.

samdesota · February 3, 2021, 3:00pm

Still producing the same issue on node v14, I created a github repo to reproduce:

Should be able to just clone, npm install, and run node parse.js and the script will assert that you are getting the same result as on my computer. Ran this test on node v12 + node v14.

marijn · February 3, 2021, 3:53pm

I am getting the same results. And the results look fine. Now I’m wondering what you mean by “the output is ambiguous”.

samdesota · February 3, 2021, 4:34pm

samdesota:

 "types": [
          {
            "name": "⚠",
            "props": {},
            "id": 0,
            "flags": 6
          },
          {
            "name": "Atom",
            "props": {},
            "id": 1,
            "flags": 1
          },
          {
            "name": "Text",
            "props": {},
            "id": 2,
            "flags": 0
          },
          {
            "name": "Element",
            "props": {},
            "id": 3,
            "flags": 0
          },
          {
            "name": "ElementAttribute",
            "props": {},
            "id": 4,
            "flags": 0
          }
        ]

Okay, so I assumed that these types indicated some sort of parser ambiguity. I thought that maybe I was misunderstanding the output, now I see the output is pretty hard to understand without using .cursor()

Thanks for looking into this, reading the docs through again.

marijn · February 3, 2021, 5:08pm

Calling toJSON on the tree isn’t going to give you a very good sense of what it is. Try toString or, indeed, iterate through it with a cursor.