Non-greedily matching part of a line of text?

wardellbagby · December 26, 2022, 4:08am

I’m attempting to write a grammar for my app Lyricistant, where the grammar tokenizes comments and non-comments.

That’s not a proper explanation, so here’s an example of the type of text I’m talking about here:

Hello world! // a comment
// This is also a comment
I'm Wardell!

What I’m attempting to parse from that is a tree that looks something like this:

Lyric(“Hello world!”)
LineComment(“// a comment”)
LineComment(“// This is also a comment”)
Lyric(“I’m Wardell!”)

What I’ve got so far is this:

@top Lyrics { expression* }

@skip { space }

expression { LineComment | Lyric }

@tokens {
  space { @whitespace+ }
  LineComment { "//" ![\n]* }
  Lyric { ![\n] }
  @precedence { LineComment, Lyric, space }
}

@detectDelim

The problem with my current grammar is that it matches each character in a Lyric individually (but it does completely match a LineComment correctly.) Attempting to make it greedy via Lyric { ![\n]+ } makes it so that comments at the end of a line are recognized as being apart of a Lyric instead of a LineComment.

I’ve been reading up on the Lezer docs but I fully admit I understand very little of it, but not for lack of trying! I feel like I’m probably misusing tokens here but I’m not sure how to move forward here.

marijn · December 26, 2022, 10:18am

It looks like Lyric should not consume double slashes. Something like Lyric { ![\n/] Lyric? | "/" (@eof | ![\n/] Lyric? }

wardellbagby · December 26, 2022, 12:21pm

You’re right in that it shouldn’t consume double slashes; I hadn’t thought of it that way and viewing the problem like that does make a solution more obvious. That being said, the solution of:

Lyric { ![\n/] Lyric? | "/" (@eof | ![\n/] Lyric?) } (yours had a tiny parsing error in missing the closing paren) works for all cases except a line that is just a “/” followed by empty lines.

Going back to my original text, if we add a new line:

Hello world! // a comment
// This is also a comment
I'm Wardell!
/

Now the tree looks like:

Lyric(“Hello world!”)
LineComment(“// a comment”)
LineComment(“// This is also a comment”)
Lyric(“I’m Wardell!”)
Error("/")

I honestly think this is fine (and a far-cry better than what I had before) since Lezer is already pretty generous with errors and all I really care about is correct representation of comments; I can assume any errors are just lines to-be until proven otherwise.

Thanks so much for being so unbelievably helpful!

marijn · December 26, 2022, 1:44pm

Ugh, indeed, that token isn’t correct, and it seems like it isn’t possible to express exactly what we need here with a strictly regular expression, so you will have to write a (simple) external tokenizer for your Lyric token.

wardellbagby · December 26, 2022, 8:09pm

You were right; it was simple and I did do it! To close the loop for any future searchers, here is the code for the grammar and here is the code for the external tokenizer.

The external tokenizer effectively just keeps consuming character by character until it either hits a new line, an end of file, or a double //. Once it finds its end condition, it spits out a single token and exits.

It did take me an embarrassingly long time before I realized the token I’m supposed to pass into Input.acceptToken is supposed to be retrieved from a compiled grammar. I tried to give it everything else before I found out.

marijn · December 29, 2022, 9:55pm

(Reply deleted that was made to the wrong thread.)

marijn · January 9, 2023, 1:23pm

For future reference, the new @local tokens feature might be useful in cases like this.