Proper local token group usage and incorrect matching

Fevol · February 23, 2023, 8:07am

This question concerns the same grammar as described in this post, namely the criticmarkup syntax. The current implementation of the syntax can be found in the details below.

A quick description of how the syntax works:

There are five type of markup: Addition, Deletion, Substitution, Comment and Highlight
This markup is defined by following characters: {++ ... ++}, {-- ... --}, {~~ ... ~> ... ~~}, {>> ... <<}, {== ... ==} respectively
Markup inside other markup (nested markup), should not get parsed, i.e: {++ {-- text --} ++} should get parsed as Addition({-- text --})
The markup is used together with regular Markdown syntax (however, for my ViewPlugin, I only care about parsing the CriticMarkup syntax; I do not need to know what its contents are either)

Grammar

@detectDelim
@top CriticMarkup { (content|expression)* }

expression {
  Addition |
  Deletion |
  Substitution |
  Comment |
  Highlight
}

@skip { } {
  Addition { lAdd content? rAdd }
  Deletion { lDel content? rDel }
  Substitution { lSub content? MSub content? rSub }
  Comment { lCom content? rCom }
  Highlight { lHig content? rHig }
}

@local tokens {
  lAdd { "{++" }
  rAdd { "++}" }
  lDel { "{--" }
  rDel { "--}" }
  lSub { "{~~" }
  MSub { "~>" }
  rSub { "~~}" }
  lCom { "{>>" }
  rCom { "<<}" }
  lHig { "{==" }
  rHig { "==}" }
  @else content
}

@precedence {
  Addition,
  Deletion,
  Substitution,
  Comment,
  Highlight
}

Working examples

{++This is an addition++}
{++It works properly
across multipe lines++}

**Regular markdown** can also appear between the text

{-- A deletion              --}

{~~ A substitution    ~> to this ~~}

{>>A comment node<<}

{==Finally, a highlight==}

Issues

1. Nested markup

Consider the example below:

{++ {--text--} ++}

For my implementation, I’d expect to have this parsed as an Addition node with contents {--text--} and that the Deletion rule would not included in the output – in general: I do not want to allow nested nodes.

However, the parse output tree gives the following:

CriticMarkup(Addition(),Deletion,)

name	from	to	content
Addition	1	5	{++
	5	5
Deletion	5	15	{–text–}
	16	19	++}

This makes sense, since I’m currently specifying that there will be only content (i.e.: non-tokens) between the markup brackets. So to solve that, I figured it should be as simple as also allowing tokens to exist between the brackets; however, no matter what I approach I took, I kept getting either S/R and R/R conflicts, or an error mentioning: Tokens from a local token group used together with other token

Is this a precedence/ambiguity issue, and thus a matter of correctly formulating the rule, or is it just not possible to implement this specific syntax the way I envisioned using local token groups?

2. Incorrect matching

Input:

{++ text --}

Output:
CriticMarkup(Addition())

name	from	to	content
Addition	1	13	{++ text --}
	10	14	–}

Sadly, this one I understand even less. Why does the parser match the Addition node, despite the fact that it has not encountered the rAdd token, as described in the rule: Addition { lAdd content? rAdd }?

I apologise if my explanations were unclear, I have only recently started dabbling with parsers again, and I’m still trying to re-learn how grammars should be constructed; it’s highly likely that I’m making one (or many) rookie mistakes here.

Many thanks in advance!

marijn · February 23, 2023, 11:03am

The way you use the @local tokens block makes these all part of a single local token context, which means that in any of the blocks, any closing (and opening!) token is parsed. That’s clearly not what you want—in a {-- block, only the --} token needs to be matched, not any of the others. So I guess what you need is either a separate @local tokens block (with its own content token type) for each block type, or an external tokenizer for each type (which you can define using a single piece of code parameterized by the closing token).

Fevol · February 23, 2023, 11:18am

Many thanks for taking the time to respond and giving a detailed & clear explanation, I immensely appreciate it! I’ll be trying both approaches later this evening, and will update this post with my findings (hopefully being the working grammar).

Thanks again for the help!

Fevol · February 23, 2023, 9:16pm

I wished I would be able to find a fix using one of the two aforementioned methods, but sadly, after trying both methods for a couple hours, I’ve made little to no progress.

Method 1: Split local token groups

(…) So I guess what you need is either a separate @local tokens block (with its own content token type) for each block type

This does seem like the most promising (and functionality-wise, simplest) implementation, but even after trying many different variations, I’ve gotten no further - with the current grammar, I’m getting the Tokens from a local token group used together with other tokens (lAdd with lHig) error; is that due to independent local token groups being combined in the expression, or perhaps somewhere else?

Failing grammar ('Tokens from a local token group...')

@detectDelim
@top CriticMarkup { (expression)* }

expression {
  Addition |
  Deletion |
  Substitution |
  Comment |
  Highlight
}

@skip { } {
  Addition { lAdd additionContent? rAdd }
  Deletion { lDel deletionContent? rDel }
  Substitution { lSub substitutionContent? MSub substitutionContent? rSub }
  Comment { lCom commentContent? rCom }
  Highlight { lHig highlightContent? rHig }
}

@local tokens {
  lAdd { "{++" }
  rAdd { "++}" }
  @else additionContent
}

@local tokens {
  lDel { "{--" }
  rDel { "--}" }
  @else deletionContent
}

@local tokens {
  lSub { "{~~" }
  MSub { "~>" }
  rSub { "~~}" }
  @else substitutionContent
}

@local tokens {
  lCom { "{>>" }
  rCom { "<<}" }
  @else commentContent
}

@local tokens {
  lHig { "{==" }
  rHig { "==}" }
  @else highlightContent
}

I also tried separating the skip rules into separate skip blocks (in case that was the issue), but that wasn’t the issue either.

I’m also wondering: with this approach, how would you skip text that appears outside of the nodes? My guess is that just adding one of the ...contents to the top level rule wouldn’t work, but I’m not sure how else you’d have to do it while still using local token groups.

Method 2: External tokenizer

(…) or an external tokenizer for each type (which you can define using a single piece of code parameterized by the closing token).

I’m going to be honest, I’m completely stuck with this one. I’ve tried looking at as many examples I could find, but I’m afraid most are too difficult for me to comprehend properly, so I have little to no idea what I’m actually doing with this.

Below is what I have currently, which I hope is at least part of the way to a working solution, note that I haven’t implemented parsing for MSub yet ({~~ original ~> replacement ~~} (MSub being the ~>):

tokens.js

import {ExternalTokenizer} from "@lezer/lr"
import {
	RAdd, RDel, RSub, RCom, RHig,
	LAdd, LDel, LSub, LCom, LHig,
} from "./parser.terms.js"


const leftBracket = {
	RAdd: LAdd,
	RDel: LDel,
	RSub: LSub,
	RCom: LCom,
	RHig: LHig,
}

const openingBracket = "{".charCodeAt(0)
const innerSymbol = {
	RAdd: "+".charCodeAt(0),
	RDel: "-".charCodeAt(0),
	RSub: "~".charCodeAt(0),
	RCom: ">".charCodeAt(0),
	RHig: "=".charCodeAt(0),
}
const EOF = -1

// Template tokenizer for every bracket type
function bracketTokenizer (close) {
	return new ExternalTokenizer((input, stack) => {
		let current = input.peek(0)
		let next = input.next

		while (true) {
			if (next === EOF) { break }
			else if (current === openingBracket && next === innerSymbol[close] && input.peek(2) === innerSymbol[close]) {
				input.acceptToken(leftBracket[close]);
			} else {
				current = input.advance();
				next = input.next;
			}
		}

	})
}


export const lAdd = bracketTokenizer(RAdd);
export const lDel = bracketTokenizer(RDel);
export const lSub = bracketTokenizer(RSub);
export const lCom = bracketTokenizer(RCom);
export const lHig = bracketTokenizer(RHig);

Grammar using external tokens

@detectDelim
@top CriticMarkup { (expression)* }

expression {
  Addition { LAdd RAdd } |
  Deletion { LDel RDel } |
  Substitution { LSub MSub RSub } |
  Comment { LCom RCom } |
  Highlight { LHig RHig }
}

@external tokens lAdd from './tokens' { LAdd }
@external tokens lDel from './tokens' { LDel }
@external tokens lSub from './tokens' { LSub }
@external tokens lCom from './tokens' { LCom }
@external tokens lHig from './tokens' { LHig }


@tokens {
  RAdd { "++}" }
  RDel { "--}" }
  MSub { "~>" }
  RSub { "~~}" }
  RCom { "<<}" }
  RHig { "==}" }
}

Input

{++ text ++}

Output

CriticMarkup(⚠,Addition(⚠,RAdd),⚠)

name	from	to	content
	0	14	{++ text ++}
Addition	10	13	++}
	10	10
Radd	10	13	++}
	13	14

Aside from the issue of skipping any characters that are inside/outside the node (I haven’t the slightest idea on how to possibly solve that, do I need to add a skip rule for any type of character that’s not a token? Local token groups don’t seem to work with the external tokens either).

Any help/pointers toward a potential solution would be greatly appreciated!

marijn · February 23, 2023, 11:19pm

Do not put the opening tokens in the local token groups. When parsing content or a close token, you want only those two tokens, nothing else, so those should be in a group. When an opening token is being parsed, other tokens are valid, so that’s a different token context (and you’ll probably want to use a regular @tokens group for it).

Fevol · February 23, 2023, 11:50pm

Many, many thanks for the quick response, I applied your suggestion, which did get rid of the Tokens from a local group... error and managed to process all normal and nested nodes perfectly!

Working examples:

{-- text --} → CriticMarkup(Deletion)
{++ text {--nested--} ++} → CriticMarkup(Addition)
{++ text ++}{== text ==}{-- text --}{>> text <<}{~~ text ~> text ~~} → CriticMarkup(Addition,Highlight,Deletion,Comment,Substitution(MSub))
x{>> text <<} → CriticMarkup(Comment)

There remains only one last problem – or question really – I’m currently using char to match the top-level text, and this generally works perfectly; except for when there is a mixed node (see examples below). Is this a precedence issue, or perhaps something else?

Failing examples:

{-- text ++} → CriticMarkup(Deletion(⚠))
{-- text ++}{-- text ++} → CriticMarkup(Deletion(⚠))

Updated grammar

@detectDelim
@top CriticMarkup { (char|expression)* }

expression {
  Addition |
  Deletion |
  Substitution |
  Comment |
  Highlight
}

@skip {} {
  Addition { lAdd additionContent? rAdd }
  Deletion { lDel deletionContent? rDel }
  Substitution { lSub substitutionContent? MSub substitutionContent? rSub }
  Comment { lCom commentContent? rCom }
  Highlight { lHig highlightContent? rHig }
}

@tokens {
  lAdd { "{++" }
  lDel { "{--" }
  lSub { "{~~" }
  lCom { "{>>" }
  lHig { "{==" }

  char { $[\n\r\t\u{20}\u{21}\u{23}-\u{5b}\u{5d}-\u{10ffff}] | "\\" esc }
  esc  { $["\\\/bfnrt] | "u" hex hex hex hex }
  hex  { $[0-9a-fA-F] }

}

@local tokens {
  rAdd { "++}" }
  @else additionContent
}

@local tokens {
  rDel { "--}" }
  @else deletionContent
}

@local tokens {
  MSub { "~>" }
  rSub { "~~}" }
  @else substitutionContent
}

@local tokens {
  rCom { "<<}" }
  @else commentContent
}

@local tokens {
  rHig { "==}" }
  @else highlightContent
}

marijn · February 24, 2023, 6:35am

The way I understood you before was that inside {--, tokens like ++} should just be parsed as content. Isn’t that the case?

Fevol · February 24, 2023, 12:40pm

Apologies, I realize I’ve given some unclear examples to specificy what exactly is going wrong, and what I meant by the ‘mixed nodes’ (which is probably a terrible descriptor).

It is in fact still the case that I would like to parse any other tokens as content if they’re within another node ({++ text {-- --}++} should get parsed as an Addition node (with text {-- --}) – this currently works perfectly.

However, with the latest grammar, there is an issue where if an expression that is not properly closed, it will consume all tokens occuring after the opening bracket as its content, even if some of those other tokens form a syntactically/gramatically correct node themselves.

Example below:

Input 1 _{(Opening bracket matches all subsequent tokens)}

Current output

{++ wrong --} {-- text --} → CriticMarkup(Addition(⚠))

name	from	to	text
Addition	0	26	{++ wrong --} {-- text --}
	26	26

Expected output

{++ wrong --} {-- text --} → CriticMarkup(Deletion)

name	from	to	text
Deletion	14	26	{-- text --}

Perhaps this is a wrong assumption, but is this due to the fact that in the local token groups, the additionContent token group has a higher precedence than any of the left bracket tokens? I did try adding precedence groups for each of the local token groups, but this did not seem to have any effect.

(Grammar from previous post has not been changed)

marijn · February 24, 2023, 12:54pm

Well, yes, while parsing the content the parser will not know yet that this node isn’t properly closed, so that is indeed the behavior that follows from your specification of non-matching delimiters being a valid part of a block’s content.

Fevol · February 24, 2023, 1:17pm

Ah, I see, that makes a lot of sense. I’m assuming that the external tokenizer approach is the only way to implement this behaviour then?

As I understand, this external tokenizer will have to do the following things:

If a left bracket is matched, only emit this token if a corresponding right bracket token is found in the string
Any other bracket that is encountered should be increased in depth using a ContextTracker, so I know not to emit tokens inside well-formed brackets
Any other input should just be emitted as content to be skipped

Since I’ve taken up a lot of your time already, this will be my last question – I’ll do my best at figuring out any future issues myself. I am already extremely grateful for all the answers you’ve given to my noobie questions, I’ll be sending you a coffee later today for all the help.

marijn · February 24, 2023, 2:20pm

Unlimited lookahead in an external tokenizer is definitely going to impact how much incremental parsing Lezer is going to be able to do. I’m not really sure what way to point you in at this point—the thing you’re trying to do may not match what Lezer can do very well.