Handle ambiguity in production rules

Hello again!

I’m working on Hedy (you can see an example here), a language dedicated to teach kids programming. This language has lots of ambiguity, and I’m not sure how can I handle them using Lezer. I’m aware that I can mark the places where Lezer needs to use GLR, but it’s not clear to me how that would work with this particular language.

I’m having trouble with this particular construct: on Hedy there are no strings delimiters until later on in the lessons, so an assigment looks like this:

Name IsToken Text

like: age is 12

Or also like this, for asking input to the user:

Name IsToken AskToken Text

like: age is ask What is your name?

My problem lies in how this tokens are defined:

@tokens {
    @precedence { 
        AskToken, 
        IsToken,     
        Name,
        Text,
        TextWithoutSpaces
    }
    
    Comment { "#" ![\n]* }
    AskToken { " "* "ask" " "* }
    IsToken {  " "* "is" " "* }
    eol { "\n" }
    TextWithoutSpaces { ![\n #]+ }
    Text { (![\n#ـ\r])(![\n#\r]*) }
    Name { letterOrUnderscore letterOrNumeral*}
    letterOrUnderscore { (@asciiLetter | $[_\u{a1}-\u{10ffff}])+ }
    letterOrNumeral { letterOrUnderscore | (@digit | letterOrUnderscore)* }    
}

So when I try to parse var is several lines of text
This is the tree I get:

Program(
    Command(
        Assign(
            Var(Name),
            IsToken,
            ⚠(Name),
            Text
        )
    )
)

Notice how first I get a :warning:(Name) Node before it parses text. How can I do that it parses everything after is as text.

This is my full grammar:

@top Program { eol* (Command eol+)* Command? }
Command {
    Assign  | Ask | ErrorInvalid
}

Assign { Var IsToken Text }
Ask { Var IsToken AskToken Text }
ErrorInvalid { TextWithoutSpaces Text? }

Var { Name }

@tokens {
    @precedence { 
        AskToken, 
        IsToken,     
        Name,
        Text,
        TextWithoutSpaces
    }
    
    Comment { "#" ![\n]* }
    AskToken { " "* "ask" " "* }
    IsToken {  " "* "is" " "* }
    eol { "\n" }
    TextWithoutSpaces { ![\n #]+ }
    Text { (![\n#ـ\r])(![\n#\r]*) }
    Name { letterOrUnderscore letterOrNumeral*}
    letterOrUnderscore { (@asciiLetter | $[_\u{a1}-\u{10ffff}])+ }
    letterOrNumeral { letterOrUnderscore | (@digit | letterOrUnderscore)* }    
}

@skip { Comment }

I’d use @extend for contextual keywords like ask. And maybe it helps to parse the kind of things that can both have a grammatical role and be part of text as their token (identifier, operator, etc), and have a nonterminal that can group a bunch of them together into a Text node. But yeah, this language sounds hard to write a tokenizer for.

I don’t use @extend for keywords since this language is also translatable to many languages, including Arab, where the tokens look like this:

PrintToken {  ( "ـ"* "ق" "ـ"* "و" "ـ"* "ل" "ـ"*  | "print") }

Or like this

PrintToken {  ( "imprimir"  | "print") }

What do you think would be the way forward here? Do I need to write an external tokenizer? To transpile the language we actually use Earley algorithm due to the many ambiguities this language has.

If you want the keywords to be parameterized at runtime, you can use either external tokenizers or an external specializer. If they are a fixed set, regular @extend should still work.

Thank you! And what about distinguishing Var and Text in the Assign rule? That’s mainly what I want, as you can see Var has a highest precedence, and therefore it parses the first word after is like that, but I don’t want that to be the case, I want in this particular case for Text to have a higher priority?.

I don’t think having a separate Text and Var token like that is going to work. That’s why I proposed making Text a nonterminal that accepts multiple tokens like Var and other things that may, in another context, have another role.

1 Like

Ohhhhh, I see!!! Thank you very much Marijn!