I’m currently finishing my rewrite of the julia lezer grammar, and so far it’s going really well. Better than the tree-sitter in some places. However, some of the tricker parts of the language are still causing problems.
Julia allows all keywords to be used as identifiers in certain places:
- As symbols, e.g.
:module
- As field names, e.g.
obj.module
It also allows some keywords to be used as identifiers, if:
- Inside indexing brackets (only for
begin
andend
), e.g.arr[begin + 1]
- they’re part of a “compound keyword” like
mutable struct ...
orabstract type ...
, e.g.mutable = true
To make this work, I tried using @extend
, but @extend
allows using keywords as identifiers anywhere, and this causes problems when both interpretations are valid.
Consider the following two programs:
@wasm module Foo
end
@wasm module Bar
function f(b::Int32)
if b
1
else
-1
end
end
end
Macro calls are parsed as a sequence of expressions (with no delimiters) like @mac expr1 expr2 ...
. So the first assignment will correctly be parsed as:
Program 0..21
MacrocallExpression 0..20
MacroIdentifier 0..5
Identifier: wasm
MacroArguments 6..20
ModuleDefinition 6..20
module: module
Identifier: Foo
end: end
However, for the second example, the parser gives up trying to parse the module definition,
and parses module Bar
as two identifiers. Then it continues to parse the function definition,
and finally parses end
as an identifier.
Program 0..90
MacrocallExpression 0..16
MacroIdentifier 0..5
Identifier: wasm
MacroArguments 6..16
Identifier: module
Identifier: Bar
FunctionDefinition 19..86
function: function
Signature 28..37
CallExpression 28..37
Identifier: f
(: (
Arguments 30..36
BinaryExpression 30..36
Identifier: b
Identifier: u32
): )
IfStatement 42..80
if: if
Condition 45..47
Identifier: b
IntegerLiteral: 1
ElseClause 59..73
else: else
UnaryExpression 70..72
IntegerLiteral: 1
end: end
end: end
Identifier: end
Obviously, I don’t want module
or other keywords to be parsed as identifiers outside the cases I outlined above. So now I’m looking at alternatives to @extend
that are more precise so this doesn’t happen.
- Is there a way to use precedence to guide the GLR extension to only consider the identifier cases in certain places?
- Is there a way to resolve this using an external tokenizer? (We already use the tokenizer to parse identifiers)
Sorry for not including an MWE. I feel like the problem is very open ended, so for now I’m just looking for general pointers.