Retaining nested token structure

aukeroorda · October 22, 2024, 2:52pm

Hi CodeMirror and Lezer users,

I’m trying to create a programming language that uses simple annotations to mark elements of cooking recipes. I’ve got most of it working, but am struggling to get mixed numbers parsed correctly. I’ve tried two separate approaches, and I run into different issues.

These are the different type of exact_values I’m trying to parse:

1     // Natural number
22    // (can be multiple digits)
3/4     // Fractions
8 / 15  // (can be multiple digits, can have whitespace)
2 3 / 4 // Mixed numbers (natural followed by a fraction)

Approach 1: No structure within tokens

In the current approach in my language I’ve been able to parse these using tokens, assigning the correct top-level type of numbers, but this loses the subtree structure of fractions and mixed numbers:

@precedence { mixed, fraction, natural }
Exact_value {
    !mixed Mixed
  | !fraction Fraction
  | !natural Natural_number
}

@tokens {
// ...
  Mixed                        { Natural_number Hwhitespace? Fraction }
  Fraction                     { Natural_number "/"  Natural_number }
  Natural_number               { $[1-9]$[0-9]* }
  @precedence {Fraction, Mixed, Natural_number, Quantity_unit}
  
  Hwhitespace                  { $[ \t]+ }
}

However, with this in place, I lose the ‘subtree’ structure of mixed numbers and fractions, e.g.

[3]
[1/2]
[2 1/2]

parses to

My first question therefore is: is it possible to retain a subtree of tokens?

Approach 2: Precedence (?) trouble

The other approach I have, is to not use ‘nested’ tokens to match the exact_values, but to define appropriate terms and rules to match the structure I’m trying to capture:

@top recipe { (Exact_value "\n")+ }

@precedence { mixed, fraction, natural }
Exact_value {
    !mixed Mixed
  | !fraction Fraction
  | !natural Natural_number
}

Mixed { Natural_number Hwhitespace Fraction }
Fraction { Natural_number Hwhitespace? "/" Hwhitespace? Natural_number}

@tokens {  
  Natural_number               { $[1-9]$[0-9]* }
  Hwhitespace                  { $[ \t]+ }
  "/"
}

In an isolated environment this works, correctly matches all cases shown above. However, when incorporated into the other parts of my language, the Natural_number option is matched and the fraction labelled as an error, even though the Mixed rule is given precedence. I have to admit that I’m quite new to Lezer, so I might’ve made some rudimentary error elsewhere.

My second question is: How to give presedence to matching the ‘longer’ option Mixed over early matching the Natural_number rule?

For the second question, this is the code and debug recipe used to debug/test this in a Lezer playground (https://lezer-playground.vercel.app/): NOTE: I’ve removed the optional whitespace in the Fraction rule here already.

Recipe:

# recipe

- [33] apples
- [1/2] apples
- [2 1/2] apples

Grammar:

@top recipe { block+ }

block { Paragraph | "\n" }

Paragraph {
  (Inline newline_or_eof)+ newline_or_eof
}

Inline {
  ( Quantity 
  | Non_delimiter_text
  )+
}

Quantity                     { "[" Exact_value? Hwhitespace? Quantity_unit? "]" }

@precedence { mixed, fraction, natural }
Exact_value {
    !mixed Mixed
  | !fraction Fraction
  | !natural Natural_number
}

Mixed { Natural_number Hwhitespace Fraction }
Fraction { Natural_number "/" Natural_number}

@tokens {
  Non_delimiter_text           { ![\n\[\]\{\}\@\|<>]+ }

  Quantity_unit                { ![0-9\n\[\]\{\}\@\|<>/ \t]![0-9\n\[\]\{\}\@\|<>]* }
  // Mixed                        { Natural_number Hwhitespace? Fraction }
  // Fraction                     { Natural_number "/"  Natural_number }
  Natural_number               { $[1-9]$[0-9]* }
  // @precedence {Fraction, Mixed, Natural_number, Quantity_unit}
  
  Hwhitespace                  { $[ \t]+ }
  // Delimiting tokens to render in tree
  "/"
  newline_or_eof { "\n" | @eof}
}

I hope I have provided enough context and information for my questions, but I’ll gladly provide any missing information!

Kind regards,
Auke

marijn · October 22, 2024, 3:17pm

You’ll want to specify the precedence markers at the point where the actual choice is made, not before the rules. Also this is much easier if you use @skip to model the skipping of whitespace, so that you don’t get LR(1) conflicts when the parser needs to look ahead past whitespace + a token. Something like this:

@top recipe { block* }

block { Paragraph | "\n" }

Paragraph { (Inline newline_or_eof)+ newline_or_eof }

Inline { (Quantity | Non_delimiter_text)+ }

@precedence { mixed, fraction, natural }

@skip { Hwhitespace } {
  Quantity { "[" Exact_value? Quantity_unit? "]" }
  Exact_value { Mixed | Fraction | Natural_number }
  Mixed { Natural_number !mixed Fraction }
  Fraction { Natural_number !fraction "/" Natural_number}
}

@tokens {
  Non_delimiter_text           { ![\n\[\]\{\}\@\|<>]+ }
  Quantity_unit                { ![0-9\n\[\]\{\}\@\|<>/ \t]![0-9\n\[\]\{\}\@\|<>]* }
  Natural_number               { $[1-9]$[0-9]* }
  Hwhitespace                  { $[ \t]+ }
  "/"
  newline_or_eof { "\n" | @eof}
}

aukeroorda · October 22, 2024, 3:32pm

Wow, thank you so much for your quick and informative reply! Going to try this out right now!

Kind regards,
Auke

aukeroorda · October 22, 2024, 4:27pm

Yes, this works flawlessly! Thanks a lot

Auke