How to properly handle "." conflict for float literals and member expressions

MarkeyMark · November 8, 2024, 2:28am

I am trying to write my language so that it can support member expressions and floating point literals, but the “.” symbol causes conflicts. Would someone help point me in the right direction?

For floating point numbers I expect formats with or without the leading number:

.3 is valid
0.3 is valid

For member expressions, the following inputs end up with float literals instead of member expressions:

1__id.2__id.3__id
@User.1.2

Ideally, I would want it to resolve into something like the following syntax trees:

For 1__id.2__id.3__id

Program(
    FieldIdentifier(
        IdentifierMember,
        IdentifierMember,
        IdentifierMember
    )
)

For @User.1.2

Program(
    SysVarIdentifier(
        IdentifierMember,
        IdentifierMember,
        IdentifierMember
    )
)

This is what my Lezer Grammar looks like:

@top Program { expression* }

@skip { space }

expression {
    literal | baseIdentifier
}

baseIdentifier[@isGroup=Identifier] {
    FieldIdentifier |
    SysVarIdentifier
}

literal[@isGroup=Literal] {
    FloatLiteral
}

SysVarIdentifier {
    "@" sysVarIdentifierMember ("." sysVarIdentifierMember)*
}

FieldIdentifier {
    fieldIdentifierMember ("." fieldIdentifierMember)*
}

@tokens {
    space { @whitespace+ }

    fieldIdentifierMember[@name=IdentifierMember] {
        $[a-z0-9]+ ("_" $[a-z0-9]+)* "__id"
    }
    sysVarIdentifierMember[@name=IdentifierMember] {
        ($[a-zA-Z0-9] | "_" | "/" | ":" | ";" | "-" | "+" | "*" | "[" | "]" )+
    }

    FloatLiteral { @digit* "." @digit+ }

    @precedence { fieldIdentifierMember, sysVarIdentifierMember, FloatLiteral, "." }
}

@detectDelim

MarkeyMark · November 8, 2024, 2:49am

I also just realized SysVarIdentifier and FieldIdentifier should also not be allowing spaces between the properties

marijn · November 8, 2024, 6:52am

The way languages like JavaScript handle this is that a dot followed or preceded by a digit is always a floating point literal (which in Lezer would mean a higher token precedence on the floating point literal).

MarkeyMark · November 8, 2024, 7:42am

That would clear up the conflict, but I am confused at how that would create a useful syntax tree.

If I were to write something like some_name__id.1_other_name__id, I would want the tree to look something like Program(MemberExpression(Object, Property)), but if I were to put a higher precedence on the Float Literal, I would instead get something like Program(?, FloatLiteral, ?). How does the JavaScript grammar work around this?

marijn · November 8, 2024, 7:56am

That wouldn’t be valid in JavaScript—if your identifiers can start with a digit you’ll have to find some other strategy to disambiguate this. But that’s more of a syntax design issue than a Lezer issue.

MarkeyMark · November 8, 2024, 8:27am

I see the issue now. Unfortunately, I cannot make changes to the syntax

It seems for now that the most consistent thing to do is create a token with a higher precedence than FloatLiteral, which takes the entire MemberExpression, since I cannot really break it into parts without conflicting with float literals.

Are external tokenizers able to help at all in this situation? I may just need to get the entire MemberExpression value using from and to on the document and split the string by periods.

marijn · November 8, 2024, 10:11am

If lookahead helps, an external tokenizer could do that—using some more involved logic to determine whether to produce a dot token or a float literal. Or maybe make sure your grammar doesn’t allow float literals in any places that allow dots, so that contextual tokenizing takes care of this.

MarkeyMark · November 9, 2024, 4:43pm

I noticed that you mentioned “look-ahead” but not “look-behind.” Does that mean external tokenizers don’t support look-behind?

marijn · November 9, 2024, 5:59pm

They do (via passing a negative number to input.peek).

MarkeyMark · November 9, 2024, 8:05pm

I see, I am thinking of replacing my FloatLiteral token with an External Tokenizer’s token. It’s going to be exactly the same, but it will also look-behind to see if $[a-zA-Z] exists right before the dot. Is that a correct use of external tokenizers?

@tokens {
    FloatLiteral { @digit* "." @digit+ }
}

marijn · November 9, 2024, 9:20pm

Sure, that sounds reasonable.