How to reset context tracker in external tokenizer on input change?

jnovosadova-prgs · October 18, 2024, 3:31pm

Hello,

I have written a custom language using external tokenizers. The tokenizers accept only values which are loaded at runtime (from server API). There are multiple such external tokenizers like field, fieldValue and keywords. The keyword is always first, followed by the field and fieldValues. In the external tokenizer, I use context tracker, to store keywords that were already accepted as valid tokens, so that the tokenizer can accept only fields that support these keywords as valid tokens. When the user inputs a token fieldValue, I remove the stored keywords from the context tracker.

This works well when the user types new input at the end, but I do not know how to correctly detect when the user decides to write e.g. in the middle of the currently existing input or deletes part of the input. I would need to update the context tracker in these situations but I am not sure how to achieve this.

marijn · October 18, 2024, 3:46pm

Do I understand correctly that you want to make parsing depend on content that comes after the current parse position? If so, that’s quite outside of what Lezer is designed for (in fact, relying too much context that comes before is already usually a bad idea, since it will often prevent efficient incremental parsing from being possible). If this is needed for error checking, the recommended approach is to accept everything that is vaguely syntactically valid at the parser level, and if you want to warn the user about mistakes, do that in a higher level system, such as a linter that analyzes the syntax tree. In editor context, a failed parse isn’t much use to the user anyway—it’ll just screw up highlighting, not provide useful feedback.

jnovosadova-prgs · October 21, 2024, 6:51am

No, I do not want to make parsing dependent on the content after the current parse position, but on content before the current parse position. I will provide a more detailed example.

The language grammar is defined as a series of expression that are joined by logical operators. A single expression is defined as keyword field filedValue. There are several kinds of keywords and each field supports different subset of them. The fieldValue can either be a generally a data type or one of a list of values but they are not much relevant to the question, so I will not describe them in detail.

Let’s have a specific example. There are two fields defined as:

[
  {
    field: "flows",
    keywords: ["src"],
  },
  {
    field: "port",
    keywords: ["src", "dst"]
  }
]

Given these definitions, an expression dst port 5 is valid because field port supports dst keyword. But dst flows 4 is not valid, since flows field does not support dst keyword.

In external tokenizer, when parsing the keyword, I save the keyword in context tracker when it is accepted as a valid keyword (src or dst). On parsing of the field, I want to accept only fields that support given keyword, so I use the keyword stored in the context tracker to filter the fields that support it. Only those fields are accepted as valid tokens. And finally on fieldValue acceptance, I clear the keywords stored in the context tracker.

Is it a bad idea to use the context tracker for this purpose? If so, what would be suitable use case for the context tracker and a better approach to parsing this?

I mainly run into issues with this approach when the user decides to e.g. delete a keyword after it is accepted and starts typing anew, because now I have the keyword stored in the context tracker even though it should no longer apply. So my original question was how I can clear the context tracker in this situation.

marijn · October 21, 2024, 7:20am

That sounds like a reasonable way to use a context. I assume there is some production wrapping the keyword field value syntax. If you make your context’s reduce handler reset it to whatever value you represent the empty context with when such a node is reduced, that should take care of clearing it after the expression. Since contexts are recomputed on re-parse, deleting a keyword shouldn’t cause any issues.

jnovosadova-prgs · October 21, 2024, 11:43am

Thank you, that makes sense but I do not know how to write the shift/reduce logic. Is there somewhere some example code that would do something like this? Specifically I do not know how to e.g. detect that the src token was removed because the user deleted the c at the end.

marijn · October 21, 2024, 11:55am

You don’t detect changes. A context is run alongside a parse, the parser re-parses when a change happens, so a context doesn’t concern itself with actual edits. If you’re using a context, surely you are defining shift or reduce handlers in your context tracker?

jnovosadova-prgs · October 21, 2024, 12:00pm

I currently do not use the shift and reduce at all, I am handling the context updates in the external tokenizers. E.g. for src:

export const src = new ExternalTokenizer(
    (input, stack: {context: ContextType}) => {
        processInput(input, read => {
            if (read === "src") {
                input.acceptToken(Src, 1);

                if (!stack.context.lastSrcDst.includes(read)) {
                    stack.context.lastSrcDst.push(read);
                }
            }
        });
    }
);

Based on you comments, I assume this is wrong?

marijn · October 21, 2024, 12:17pm

Very. Contexts should be immutable values. Mutating them isn’t safe.

jnovosadova-prgs · October 21, 2024, 12:52pm

So how should I properly update the context?

marijn · October 21, 2024, 1:06pm

Read the docs I linked and the indentation-sensitive grammar example. You return a new context from the handlers defined in the context tracker.