Mapping Tree-Sitter trees to Lezer trees

I’m trying to use codemirror.next for Python. It seems like writing the Python grammar in Lezer is not currently support, so in the meantime I was planning on using tree-sitter-python as a fill-in. I’m implementing a Syntax service that translates from a tree-sitter Tree into a Lezer Tree. My immediate goal is to get Python syntax highlighting working.

It’s easy enough to derive a NodeType for each tree-sitter node, but I’m having trouble creating a Lezer Tree. Right now, I’m fine with something’s that inefficient since my use case is primarily read-only. I was trying to create a Tree for each node, but that doesn’t work because Tree only works at the toplevel (start and end are fixed to 0 and length). Presumably the inner nodes have to be NodeSubtree.

But there I’m also in a bind. NodeSubtree expects a parent Subtree to its constructor. But the parent Tree expects. a list of children to its constructor. So it’s not clear how to satisfy both dependencies.

What is the recommended way to do this?

No, that’s not how that works—each individual Tree instance only knows its length, but their parent will know their position. This is necessary to allow trees to be reused across different parents, where they might sit at a different position.

You don’t need to create subtree instances, if you’re just trying to model a tree. It might also be workable to look at the fromBuffer interface—if you can iterate the original tree-sitter tree, it might be quite easy to emit an array of numbers that you can convert to a Lezer tree.

A Lezer Python grammar is relatively high on my list of priorities, but I can’t promise any concrete timeline yet.

1 Like

Thanks for the clarifications. By fromBuffer do you mean Tree.build? Do you have any documentation for the format of the buffer? The build docs just mention " a postfix-ordered buffer of node information" but I’m not sure what exactly that means.

Oops, yeah, I did mean Tree.build, and I assumed I had documented the input format, which I apparently haven’t. Each node is represented as 4 numbers, with child nodes coming before parent nodes. The numbers are, in order, the node’s type id, its start position, its end position, and its size—including children—in the array (so that is (1 + total child count) * 4, where the total child count includes also the children of the node’s direct children).

So to encode a node of type 2 with a child of type 3, both of which span position 0 to 1, you’d need an array like

[3, 0, 1, 4, 2, 0, 1, 8]

Perfect, got it working. Here’s the gist I came up with in case anyone else finds it useful.

Thanks @marijn!

Works perfectly for me. However, I noticed there is a subtle quirk: a node type with id 0 seems always reserved for the top node. For example, suppose there is a LezerTree instance:

import {parser as parserExpr} from './parsers/expr'
const tree = parserExpr.parse('(3+4)*5')

with grammar

@precedence { times left, plus left }

@top Expression { expression }

expression {
  Number |
  BinaryExpression
}

BinaryExpression {
  expression !times "*" expression |
  expression !plus "+" expression |
  "(" expression ")"
}

@tokens {
  Number { std.digit+ }
}

The value tree.type always equals to the type with id 0 (Expression in this case), even when that type is not the one of the top node. A quick fix is to explicitly include it before collecting node types:

build() {
  const rootNode = this.tree.rootNode;
  this.node_types[rootNode.type] = new _LezerTree.NodeType(rootNode.type, {}, 0);
    
  const buffer = this.build_node(this.tree.rootNode);
  /* ... */
}

This exists now: GitHub - lezer-parser/python: A Python parser for Lezer

1 Like