Context-sensitive keywords as identifiers (looking for alternatives to `@extend`)

savq · June 3, 2024, 6:21pm

I’m currently finishing my rewrite of the julia lezer grammar, and so far it’s going really well. Better than the tree-sitter in some places. However, some of the tricker parts of the language are still causing problems.

Julia allows all keywords to be used as identifiers in certain places:

As symbols, e.g. :module
As field names, e.g. obj.module

It also allows some keywords to be used as identifiers, if:

Inside indexing brackets (only for begin and end), e.g. arr[begin + 1]
they’re part of a “compound keyword” like mutable struct ... or abstract type ..., e.g. mutable = true

To make this work, I tried using @extend, but @extend allows using keywords as identifiers anywhere, and this causes problems when both interpretations are valid.

Consider the following two programs:

@wasm module Foo
end

@wasm module Bar
  function f(b::Int32)
    if b
      1
    else
      -1
    end
  end
end

Macro calls are parsed as a sequence of expressions (with no delimiters) like @mac expr1 expr2 .... So the first assignment will correctly be parsed as:

 Program 0..21
   MacrocallExpression 0..20
     MacroIdentifier 0..5
       Identifier: wasm
     MacroArguments 6..20
       ModuleDefinition 6..20
         module: module
         Identifier: Foo
         end: end

However, for the second example, the parser gives up trying to parse the module definition,
and parses module Bar as two identifiers. Then it continues to parse the function definition,
and finally parses end as an identifier.

 Program 0..90
   MacrocallExpression 0..16
     MacroIdentifier 0..5
       Identifier: wasm
     MacroArguments 6..16
       Identifier: module
       Identifier: Bar
   FunctionDefinition 19..86
     function: function
     Signature 28..37
       CallExpression 28..37
         Identifier: f
         (: (
         Arguments 30..36
           BinaryExpression 30..36
             Identifier: b
             Identifier: u32
         ): )
     IfStatement 42..80
       if: if
       Condition 45..47
         Identifier: b
       IntegerLiteral: 1
       ElseClause 59..73
         else: else
         UnaryExpression 70..72
           IntegerLiteral: 1
       end: end
     end: end
   Identifier: end

Obviously, I don’t want module or other keywords to be parsed as identifiers outside the cases I outlined above. So now I’m looking at alternatives to @extend that are more precise so this doesn’t happen.

Is there a way to use precedence to guide the GLR extension to only consider the identifier cases in certain places?
Is there a way to resolve this using an external tokenizer? (We already use the tokenizer to parse identifiers)

Sorry for not including an MWE. I feel like the problem is very open ended, so for now I’m just looking for general pointers.

marijn · June 3, 2024, 6:50pm

These both seem unproblematic—a symbol would be a separate token, and thus be unambiguous. For field names, you should be able to define a separate token FieldName, which matches the same input as identifiers/keywords, but is only used after the dot character. The parser should be smart enough to use the appropriate token in the appropriate context, assuming they never appear in the same context.

savq · June 4, 2024, 12:35am

I think I didn’t explain myself properly My problem has to do with false positives.

The grammar does handle the cases I listed above. I have some tests in this file (search for “Keywords as identifiers”).

However, sometimes @extend will parse a keyword as an identifier when it should not parse an identifier. So I’m looking for something to restrict @extend, but not as strict as @specialize.

I cannot define symbols as tokens or in a skip rule because Identifier is an external token and : is an operator (a complete expression).

marijn · June 4, 2024, 5:34am

I do not have time to dig into the subtleties of the Julia grammar, but unless :symbol and the : operator can appear in the same place, the operator can again be used to make sure the token context is different, and use a different identifier token in that situation to avoid the ambiguity.

There’s no reason an external tokenizer cannot be used multiple times to create multiple different tokens.

But if all else fails, @extend plus @dynamicPrecedence might also get you enough control over which interpretation is picked, though in a less efficient way.

savq · June 6, 2024, 8:25pm

I ended up with a mixed approach to parse the various types of keywords:

For keywords that cannot be used as identifiers: @specialize<Identifier, …>.
For keywords that can be used as identifiers: @extend<Identifier, …>.
For keywords in symbols and fields: I used the same approach as the old grammar. Define two identical external tokens Identifier and word; use word for symbols and fields; and use Identifier everywhere else. That way there’s no conflict with the specialized tokens.
For begin/end index variables: Using @extend caused a lot of problems because it was hard to ensure end was parsed as a keyword. Instead, I used @specialize, and defined a different set of rules to be used inside brackets, disallowing block statements, and allowing the specialized begin and end as valid expressions. This is very hacky, but doesn’t cause any ambiguity.

Overall, a very convoluted solution… but it works.