Wrong match at beginning and end of text?

kometenstaub · April 21, 2022, 12:13pm

I am working on a criticmarkup grammer: lang-criticmarkup/criticmarkup.grammar at main · kometenstaub/lang-criticmarkup · GitHub

When using the grammar in a basic Obsidian plugin with the example from the system guide to log the nodes (by passing it a string of the current note), I get odd results.

Grammar

@precedence {
  highl @left
}

@top Criticmarkup { expression* }

expression {
  Addition |
  Deletion |
  Substitution |
  Comment |
  Highlight
}

Addition {
  startAdd InnerText endAdd
}
Deletion {
  startDel InnerText endDel
}
Substitution {
  startSubs InnerText divideSubs InnerText endSubs
}
Comment {
  startComm InnerText endComm
}
Highlight {
  startHighl InnerText endHighl !highl Comment?
}


@tokens {

  startAdd { "{++" }
  endAdd { "++}" }
  startDel { "{--" }
  endDel { "--}" }
  startSubs { "{~~" }
  divideSubs { "~>" }
  endSubs { "~~}" }
  startComm { "{>>" }
  endComm { "<<}" }
  startHighl { "{==" }
  endHighl { "==}" }

  InnerText { (char)+ }

  char { $[\u{20}\u{21}\u{23}-\u{5b}\u{5d}-\u{10ffff}] | "\\" esc }
  esc  { $["\\\/bfnrt] | "u" hex hex hex hex }
  hex  { $[0-9a-fA-F] }

}

@detectDelim

Example text

Text before

{-- Delete me --}

text between

{++addition  ++}

{>>my comment<<}

{== my  
highlight ==}

{==highlight==}{>>test<<}


more text

Output

Node Criticmarkup from 0 to 147
Node ⚠ from 0 to 0
Node Addition from 0 to 13
Node InnerText from 0 to 11
Node ⚠ from 11 to 18
Node Deletion from 13 to 46
Node InnerText from 16 to 30
Node ⚠ from 30 to 52
Node Addition from 46 to 64
Node InnerText from 49 to 62
Node ⚠ from 62 to 69
Node Comment from 64 to 82
Node InnerText from 67 to 80
Node ⚠ from 80 to 87
Node Highlight from 82 to 104
Node InnerText from 85 to 90
Node ⚠ from 90 to 101
Node ⚠ from 104 to 106
Node Highlight from 106 to 147
Node InnerText from 109 to 131
Node ⚠ from 131 to 147

Issues

The nodes are all in there, but I always get an “Addition” node at the beginning if there are characters before the first node. Its end is the beginning of the first real node. It also detects “InnerText”, although that should only be in the nodes.
The first node doesn’t match if it doesn’t start on its own line. This is not an issue for the nodes after it.
The last node gets matched, but its end is the end of the text, although there is an end marker. Its “InnerText” gets matched correctly.
For “InnerText” with line breaks, not everything in the encompassing nodes gets recognised as “InnerText”. Only the part until the line break is considered “InnerText”, after that I get error nodes, although it should match line breaks (otherwise it shouldn’t match the Addition/Deletion nodes anyway).

The only thing I can imagine for 3. is greedy matching. I didn’t find a way to make it non-greedy.

I’m completely at a loss as to 1., 2. and 4.

Versions

I’m using the following versions for generating the parser. I cannot update to the latest versions because Obsidian hasn’t updated yet.

├── @codemirror/highlight@0.19.8
├── @codemirror/language@0.19.10
├── @lezer/generator@0.15.4
├── @lezer/lr@0.15.8

Additional information

I tried to get the tests to work. In an earlier version they worked, but they needed a @skip {space}, which doesn’t work because inside the nodes the text can also have spaces. An additional skip rule which excluded the Addition/Deletion etc. nodes didn’t work, it led to overlapping tokens.

marijn · April 21, 2022, 2:31pm

Your grammar doesn’t seem to have any rule matching plain text without brace markers around it, so you’ll get error correction kicking in when you give it such text. Don’t most of these issues go away if you give the parser text that matches the grammar?

kometenstaub · April 21, 2022, 4:39pm

Thank you!

That removed the errors, but also brought new ones. Because everything inside and outside can be normal text, I cannot exlude anything really and when I define text, it stops supporting multiple lines for the tokens I care about.

I’ve maybe found a workaround by defining skip and precedence rules.

It works quite well ~~, apart from the fact that the last node still matches until the end of the file.~~

kometenstaub · April 21, 2022, 4:56pm

@skip {
  char | space | newline
}

@skip {} {
  critic {
    (Addition | Deletion | Substitution | Comment | Highlight )
    }
}

https://github.com/kometenstaub/lang-criticmarkup/blob/3b32583d50245dc1e903f7d2c83a0695c664a3e7/src/criticmarkup.grammar
This is what I have now. The “InnerText” still doesn’t support new lines, but removing three characters from the beginning/end and splitting at ~> for substitutions is good enough, I think.