Parsing ini files with Lezer

SquidDev · September 18, 2022, 6:42pm

Hello! I’m currently looking at writing a Lezer grammar/parser for systemd-style ini files for a personal project, and hit a couple of snags.

While the actual eBNF of ini files is simple, I’ve had a lot of issues getting whitespace and comments to behave correctly.

The first issue is that comments are only valid if the line starts (ignoring leading whitespace) with a #. So for instance:

# This is a valid comment
x = 2 # but this is actually part of the value

This suddenly makes comments much harder to use in @skip rules, as they’re only valid in certain contexts.

The other nasty feature is continuation lines. Like many other languages, a line can end with \ to treat the line break as a space rather than a new line. However, the following line(s) may be comments, which will be skipped before the actual content appears:

[Section header\
# A comment
continued]

This can more easily be expressed as a @skip rule. However, in the above case, the “continued” section is parsed as a syntax error, and I’m not quite sure why:

- Section: "[Section header\\\n# A comment\ncontinued]\n"
  - SectionHeader: "[Section header\\\n# A comment\ncontinued]"
    - SectionName: "Section header"
    - Comment: "# A comment" (skipped)
    - ⚠: "continued" (error, skipped)
    - SectionEnd: "]"

For completeness, here is the whole grammar:

ini.grammar

@skip { space | Comment eol | "\n" } {
  @top Unit {
    Section*
  }

  Section {
    sectionHeader
  }
}

@skip { cont (space? Comment eol)* } {
  sectionHeader { SectionHeader eol }

  SectionHeader { "[" SectionName* SectionEnd }
  SectionEnd { "]" }
  SectionName { sectionName | "\\" | "]" }
}

@tokens {
  eol { @eof | "\n" }

  space { $[ \t]+ }
  Comment { $[#;] ![\n]* }
  cont { "\\" eol }

  sectionName { ![\n\\\]]+ }
}

Was wondering if anyone with more familiarity with Lezer would be able to offer some thoughts on what I’m doing, and if there’s any other alternative routes I should be trying? It’s possible this is solvable with a context tracker + custom tokeniser, but wasn’t quite sure how to go about that.

marijn · September 19, 2022, 8:04am

Newlines in ini files are definitely significant, so I’d recommend only skipping non-newline whitespace. Comments can only occur at the start of a line, so those should probably also be part of the actual grammar. Something like this seems to work okay (though the character set for Name probably needs more attention):

@skip { space }
@top Unit { line* Section* }
Section { SectionHeader eol line* }
SectionHeader { "[" Name "]" }
Property { Name "=" Value }
line { (Comment | Property) eol | "\n" }

@tokens {
  eol { @eof | "\n" }
  space { $[ \t]+ }
  Comment { $[#;] ![\n]* }
  Name { (@asciiLetter | @digit)+ }
  Value { (![\n]|"\\\n")+ }
  @precedence { space Value }
}