Old language mode used in codemirror v6 triggers full re-parse of the entire document

I have two questions:

Question 1

I am using the language mode from CodeMirror v5 in CodeMirror v6 by utilizing StreamLanguage.define(new OldLangMode())

Here The OldLangMode class provides functions like token(stream, state), startState(), and copyState() to CodeMirror v6. However, whenever a line is changed, the token() function is called from the very beginning of the document instead of starting from the modified line. This triggers a re-parsing of the entire document from the start.

In CodeMirror v5, the parser knew where to resume parsing. When I set a breakpoint in the token() function, I can see that the line number corresponds to the line where the change occurred.

In CodeMirror v6, is there a way to make the token() function start parsing from the line where the document was changed, rather than re-parsing the entire document from the beginning?

Question 2:

When the document is updated and the token(stream, state) function is called, is it possible to access the most recently updated content? The stream parameter provides access to the current line, but the complex logic in the OldLangMode class (used in CodeMirror v5) requires knowledge of the content of subsequent lines.

In CodeMirror v5, this was achieved using the codemirror.doc object, which allowed calling doc.getLine(n) to retrieve the content of specific lines.

Is there a way to pass this information to the OldLangMode class in CodeMirror v6?

Here is a snippet of what I have tried:

const oldLangMode = new OldLangMode(tokenTable, name);
// Defined a global variable in OldLangMode and updating it here
oldLangMode.setCurrentScript(value);
  
const extensions = [
  languageCompartment.of(StreamLanguage.define(oldLangMode)),
  EditorView.updateListener.of((update) => {
    if (update.docChanged && onChange) {
      const newValue = update.state.doc.toString();
      onChange(newValue);

      // Update the most recent script when doc changes.
      oldLangMode.setCurrentScript(newValue);

      update.view.dispatch({
        effects: languageCompartment.reconfigure(StreamLanguage.define(oldLangMode)),
      });
    }
  }),
  // other extensions...
];

const state = EditorState.create({
  doc: value,
  extensions,
});

From the code you can probably tell that this is not the right way. I have defined a variable in OldLangMode and updating it whenever the doc changes. My goal was to use that most updated doc in the parsing logic. But the problem is that that update listener gets triggered after the token(stream, state) call is trigerred. So I am calling update.view.dispatch({..}) in the updateListener to retrigger the token() call, so that now I have access to the most recent doc…as you can see I am in a mess.

Any help would be highly appreciated. Thanks.

I was looking at the Parser::startParse() function. Here if we pass ranges param, it will parse only those ranges. But in my case ranges is not being passed. Which is causing the full parse of the entire doc. Is there any reason for that? Is it possible to pass the ranges when the document is changed, to parse only the given ranges?

class Parser {
    /**
    Start a parse, returning a [partial parse](#common.PartialParse)
    object. [`fragments`](#common.TreeFragment) can be passed in to
    make the parse incremental.
    
    By default, the entire input is parsed. You can pass `ranges`,
    which should be a sorted array of non-empty, non-overlapping
    ranges, to parse only those ranges. The tree returned in that
    case will start at `ranges[0].from`.
    */
    startParse(input, fragments, ranges) {
        if (typeof input == "string")
            input = new StringInput(input);
        ranges = !ranges ? [new Range(0, input.length)] : ranges.length ? ranges.map(r => new Range(r.from, r.to)) : [new Range(0, 0)];
        return this.createParse(input, fragments || [], ranges);
    }

It will not re-parse the entire document—only the chunk (of up to 2048 characters) around the edit, and anything below that.

No, you cannot access arbitrary parts of the document from a stream parser, because that would make it impossible to cache its result at all.

1 Like

2048 does seem like a bit on the high side. This patch reduces that to 512.

1 Like

Thank you very much for the patch. When is the next patch scheduled for release?

Regarding your other point:

you cannot access arbitrary parts of the document from a stream parser, because that would make it impossible to cache its result at all

Unfortunately, the old language mode we used in CodeMirror v5 is entirely dependent on having access to the full text content of the code editor. It doesn’t mutate the document in any way; it simply needs to read its content.

Here is the call stack for the top few calls when the document is updated and token() is invoked:

token           @ old-language-mode.js:1027
readToken       @ @codemirror/language/dist/index.js:2464
parseLine       @ @codemirror/language/dist/index.js:2428
advance         @ @codemirror/language/dist/index.js:2345
(anonymous)     @ @codemirror/language/dist/index.js:363

The bottom-most line provides access to the document’s text content. If we could pass this.state.doc.toString() all the way to the token() function, our issue would be resolved. I have already tested this by manually modifying the @codemirror/language/dist/index.js file in node_modules and it doesn’t appear to affect the caching of results.

Note: I am new to using CodeMirror, so please forgive me if this is a naive or stupid question :slight_smile:

You can smuggle access to the current document into your token function in your own code, if you really want to. But again, this goes against the idea that the only inputs to the tokenizer are the current line and the tokenizer state, which is required to be able to cache these, so the library isn’t going to lead people in the wrong direction by passing in the whole document.

I’ve tagged @codemirror/language 6.11.3

What is the recommended approach to do look ahead in this context? I need to get the content of a given line number.

That’s just not something the stream parser supports.