I am working on a criticmarkup grammer: lang-criticmarkup/criticmarkup.grammar at main · kometenstaub/lang-criticmarkup · GitHub
When using the grammar in a basic Obsidian plugin with the example from the system guide to log the nodes (by passing it a string of the current note), I get odd results.
Grammar
@precedence {
highl @left
}
@top Criticmarkup { expression* }
expression {
Addition |
Deletion |
Substitution |
Comment |
Highlight
}
Addition {
startAdd InnerText endAdd
}
Deletion {
startDel InnerText endDel
}
Substitution {
startSubs InnerText divideSubs InnerText endSubs
}
Comment {
startComm InnerText endComm
}
Highlight {
startHighl InnerText endHighl !highl Comment?
}
@tokens {
startAdd { "{++" }
endAdd { "++}" }
startDel { "{--" }
endDel { "--}" }
startSubs { "{~~" }
divideSubs { "~>" }
endSubs { "~~}" }
startComm { "{>>" }
endComm { "<<}" }
startHighl { "{==" }
endHighl { "==}" }
InnerText { (char)+ }
char { $[\u{20}\u{21}\u{23}-\u{5b}\u{5d}-\u{10ffff}] | "\\" esc }
esc { $["\\\/bfnrt] | "u" hex hex hex hex }
hex { $[0-9a-fA-F] }
}
@detectDelim
Example text
Text before
{-- Delete me --}
text between
{++addition ++}
{>>my comment<<}
{== my
highlight ==}
{==highlight==}{>>test<<}
more text
Output
Node Criticmarkup from 0 to 147
Node ⚠ from 0 to 0
Node Addition from 0 to 13
Node InnerText from 0 to 11
Node ⚠ from 11 to 18
Node Deletion from 13 to 46
Node InnerText from 16 to 30
Node ⚠ from 30 to 52
Node Addition from 46 to 64
Node InnerText from 49 to 62
Node ⚠ from 62 to 69
Node Comment from 64 to 82
Node InnerText from 67 to 80
Node ⚠ from 80 to 87
Node Highlight from 82 to 104
Node InnerText from 85 to 90
Node ⚠ from 90 to 101
Node ⚠ from 104 to 106
Node Highlight from 106 to 147
Node InnerText from 109 to 131
Node ⚠ from 131 to 147
Issues
- The nodes are all in there, but I always get an “Addition” node at the beginning if there are characters before the first node. Its end is the beginning of the first real node. It also detects “InnerText”, although that should only be in the nodes.
- The first node doesn’t match if it doesn’t start on its own line. This is not an issue for the nodes after it.
- The last node gets matched, but its end is the end of the text, although there is an end marker. Its “InnerText” gets matched correctly.
- For “InnerText” with line breaks, not everything in the encompassing nodes gets recognised as “InnerText”. Only the part until the line break is considered “InnerText”, after that I get error nodes, although it should match line breaks (otherwise it shouldn’t match the Addition/Deletion nodes anyway).
The only thing I can imagine for 3. is greedy matching. I didn’t find a way to make it non-greedy.
I’m completely at a loss as to 1., 2. and 4.
Versions
I’m using the following versions for generating the parser. I cannot update to the latest versions because Obsidian hasn’t updated yet.
├── @codemirror/highlight@0.19.8
├── @codemirror/language@0.19.10
├── @lezer/generator@0.15.4
├── @lezer/lr@0.15.8
Additional information
I tried to get the tests to work. In an earlier version they worked, but they needed a @skip {space}
, which doesn’t work because inside the nodes the text can also have spaces. An additional skip rule which excluded the Addition/Deletion etc. nodes didn’t work, it led to overlapping tokens.