Ideal way for parsing keywords that can be identifiers too?

mortazh · October 1, 2021, 11:17am

I have keywords like ’ Action’, ‘Routine’ and ‘Private’ which are used in my rules, but they can be occur as identifiers too. They should be parsed as identifiers when they are not specified in the rule definition.

For example, for content like Alpha Beta Gamma Routine

RuleA {

    Identifier+

}

It’s getting parsed as RuleA, (Error) Routine right now, Which should be purely RuleA,

but while doing this, I also want it to parse content like Routine Alpha

With a a rule like

RuleB {

    Routine Identifier
     
}

I fixed it using

@external extend {Identifier} extendIdentifier from "./tokens" {

  Routine

}

and then in tokens.js

const extraKeywordTokens = {

  routine : routine

}

export const extendIdentifier = (value, stack) => {
  return extraKeywordTokens[value.toLowerCase()] || -1;
}

Is this the correct way to do it? Thank you.

marijn · October 1, 2021, 12:24pm

You can also directly use @extend (instead of @specialize) in your grammar. See for example the way the JavaScript grammar handles contextual keywords.

mortazh · October 1, 2021, 2:41pm

Oh that’s concise, thank you!

Sorry for the trouble, but what would you suggest for a case like this?

There are two rules: 1) URLText 2) Identifier

Which have similar definitions, so ‘google’ can be parsed as URLText and Identifier too. I set the precedence as below, which seemed to fix it for most cases.

@precedence {URLText, Identifier}

but this breaks when in one rule, where both of these rules are used in close proximity. Input like Alpha is getting parsed as URLText, which should be parsed as an identifier.

From the docs, dynamic precedence seems to be a possible solution, but I’m not sure how to approach this with it, could you please give any suggestions?

marijn · October 1, 2021, 3:36pm

This’ll happen in parse states where both are valid. It sounds like, if you have ambiguous tokens that may both occur in a given position, your grammar has a problem, and Lezer can’t do much more than apply the precedence you specified. I’m not sure if you’re implementing an existing grammar or creating one here, but in the former case maybe check the spec for that grammar more closely, and the latter see if you can change something to fix the ambiguity.

mortazh · October 1, 2021, 5:16pm

Oh cool, thank you!

mortazh · October 8, 2021, 11:06am

Hi again Marijn, I’ve been thinking about specialize and extend.

Specifically, the below part:

There is another operator similar to @specialize , called @extend. Whereas specialized tokens replace the original token, extended tokens allow both meanings to take effect, implicitly enabling GLR when both apply. This can be useful for contextual keywords where it isn’t clear whether they should be treated as an identifier or a keyword until a few tokens later.

If it’s no bother, could you iterate over the allow both meaning to take effect part?

Prior to extend, I had most of my tokens specialized instead of extended, until I came across cases where identifiers could have the same input as the tokens, I extended those tokens.

This will mostly be off, but why not have all of the tokens extended? Why would specialize be needed? The parsing seems to work fine that way too, along with the additional benefit of not clashing with identifiers

Apologies if it’s very wrong, would really appreciate your input.

marijn · October 8, 2021, 2:12pm

Firstly it’s less efficient to follow multiple parses on every keyword. But also this’d allow things like let function = 10 in, for example, the JavaScript parser, which the language most certainly does not allow. function is always a keyword, and most languages work like that—you can’t use keywords as identifiers.

mortazh · October 8, 2021, 2:37pm

Got it. Thank you!

Just to confirm, is a use like this ideal?

specialize → Normal keywords
extend → Keywords that can cause conflict, can either be an identifier or a language token depending on the rule