Running into ambiguity issues with simple grammar

bjhijmans · November 28, 2023, 3:04pm

Greetings,

I am trying to create a grammar that compiles a very simple language. I have [mark] and [/mark] tags that I want to parse. Here are some examples:

hello world should be parsed as a single Text node
[mark]hello world[/mark] should be parsed as MarkTag => (OpenTag Text ClosedTag)
bla [mark]bla[/mark] bla should be parsed as Text MarkTag => (OpenTag Text ClosedTag) Text
[mark] [mark] [/mark] [/mark] should match the first open tag with the first closing tag. i.e. opening tags are treated as text in a mark block. No nesting.

Here are some attempts:

@top Program { expression* }

expression { MarkTag | Text }

MarkTag { (OpenTag Text CloseTag) }

@tokens {
  OpenTag { "[mark]" }
  CloseTag { "[/mark]" }
  Text { ![\n]+ }
  @precedence { OpenTag, CloseTag, Text }
}

Despite the @precedence, once it matches something as Text it matches the entire rest of the line that way

@top Program { expression* }

expression { MarkTag | Text }

MarkTag { (OpenTag Text CloseTag) }

Text { chars ("[" chars)*}

@tokens {
  OpenTag { "[mark]" }
  CloseTag { "[/mark]" }
  chars { ![[\n]+ }
  @precedence {OpenTag, CloseTag, Text }
}

This attempted to interrupt the text on every [, to get it to consider OpenTag and CloseTag again, but it doesn’t work in all cases, such as [mark][[/mark] due to requiring extra chars around the open bracket. And any attempt I made to fix that issue (such as replacing the + in chars with * created shift/reduce problems

Two of us have been working on this for ages, but we don’t seem to be getting anywhere. How do I fix this grammar?

marijn · November 28, 2023, 5:36pm

This looks like something you’ll want to use local token groups for—those are able to just put everything that isn’t one of the recognized tokens into an @else token, which you’d use for Text here.