Efficient way to get current syntax tree to extract headers

Roberto · February 3, 2022, 4:11pm

I implemented a function to extract all Markdown headers from a given state like this:

let tree = parser.parse(view.state.doc.toString())

let headerTypes = {
    'SetextHeading1'    : 1,
    'SetextHeading2'    : 2,
    'ATXHeading1'       : 1,
    'ATXHeading2'       : 2,
    'ATXHeading3'       : 3,
}

let headers = []

tree.iterate({
    enter: (type, from, to) => {
        if (Object.keys(headerTypes).includes(type.name)) {
            headers.push({
                position: from,
                level: headerTypes[type.name],
                value: view.state.doc.sliceString(from, to)
            })
        }
    }
})

This works as expected. However, a couple of things seem inefficient to me:

parser.parse(view.state.doc.toString()) creates a string and a new tree on each state update. Is there a way to access the syntax tree directly from the state without having to generate it each time?
view.state.doc.sliceString(from, to) needs to be done every time to get the header content. No better way there, right?
The resulting string of view.state.doc.sliceString(from, to) includes the header marks #. So I need to filter that string. Any way to fetch only the header content in the first place?

marijn · February 3, 2022, 4:58pm

Yes.

Not easily, but this should be cheap enough unless you have a huge amount of headings. You could try to make it incremental and only re-query the headers that changed (by observing transactions), but that is probably overkill.

The syntax tree will have mark nodes for the markup syntax, which you could find and remove, but there’s not special node for just the text.

Roberto · February 3, 2022, 5:07pm

Thanks for your quick reply!

As I read from the docs language.syntaxTree() may return an incomplete tree? Even if I waited for language.syntaxTreeAvailable(), this wouldn’t guarantee a complete tree, right? I’m interested in all headlines of the document. Inside or outside the viewport.

So I guess I would have to call language.ensureSyntaxTree() with upto == state.doc.length. But that would trigger a new parsing run, would it? Or am I misinterpreting the docs here?

marijn · February 3, 2022, 5:50pm

ensureSyntaxTree will use the existing tree insofar as it is available.

Roberto · February 3, 2022, 6:11pm

Awesome! Thanks. Will switch to that then.

Roberto · February 7, 2022, 3:41pm

I modified my solution to use ensureSyntaxTree instead:

let tree = ensureSyntaxTree(editorState, editorState.doc.length, 5000)

However, this turned out way slower in documents with lots of headlines than my previous version. It takes like ten times as long. 400 ms vs. 40 ms.

So I’m going back to reparsing the document string for the time being.

marijn · February 7, 2022, 4:31pm

That’s interesting. Can you distill that down to a snippet I can test? Because it suggests something is going wrong.