Help wanted on understanding how it applies to syntax highlighting

qbane · January 22, 2020, 2:34pm

I created a small repository for learning how Lezer works, but I felt that there is a large gap from parsing the document to put it into real use cases. So I came here to ask for clarifications.

The repo: https://github.com/andy0130tw/try-lezer, basically it is a copy from the system guide.

I know I can invoke iterate on the root tree that implements SubTree interface. So the following code snippet produces the following verbose, in comment:

import {parser as parserExpr} from './parsers/expr'
const nodeTypeRepr = type => `${type.id}:${type.name}`

const tree = parserExpr.parse('(8+9)*17+20*20')

tree.iterate({
  // from: 0,
  // to: tree.length,
  enter: (type, start, end) => {
    console.log('enter', nodeTypeRepr(type), start, end)
  },
  leave: (type, start, end) => {
    console.log('leave', nodeTypeRepr(type), start, end)
  },
})

/* Output:
enter 4:BinaryExpression 0 14                                                                             
enter 4:BinaryExpression 0 8                                                                              
enter 4:BinaryExpression 0 5                                                                              
enter 4:BinaryExpression 1 4                                                                              
enter 3:Number 1 2
leave 3:Number 1 2
enter 3:Number 3 4
leave 3:Number 3 4
leave 4:BinaryExpression 1 4
leave 4:BinaryExpression 0 5
enter 3:Number 6 8
leave 3:Number 6 8
leave 4:BinaryExpression 0 8
enter 4:BinaryExpression 9 14
enter 3:Number 9 11
leave 3:Number 9 11
enter 3:Number 12 14
leave 3:Number 12 14
leave 4:BinaryExpression 9 14
leave 4:BinaryExpression 0 14
leave 4:BinaryExpression 0 14 
*/

My question is, how I can make use of this to do syntax highlighting, or, pretty printing, etc.? If I would like to tackle with this myself, my approach should definitely be focusing on the leaf nodes only, for instance, producing a list of (type, position) pairs. But it turns out that no such functionality is in store. I wonder which part is wrong in my thoughts. Thanks!

marijn · January 22, 2020, 2:52pm

You can see this code that does highlighting based on a tree. Basically it just emits spans based on the innermost styled node that covers a given range of code (though there’s also a bunch of subtle stuff going on related to style matching and inherited styles).

Using this for pretty-printing is a whole different thing, and I haven’t really thought about that.

nikku · January 26, 2020, 8:39pm

I’ve implemented the approach @marijn mentions in a small app and it works perfectly fine.

First I iterate over a lezer tree and collect all tokens that are worth syntax highlighting. Each token is a {start, end, tokenType} element (cf. here).

For each token worth highlighting I insert spans with mark-{tokenType} as a class name. In my case, using CodeMirror@5 CodeMirror#markText(...) does the job for me (cf. here).

The performance is pretty bad, as I’m creating hierarchies of <span/> elements. If I need to highlight the function call foo(a), the result will be something like:

<span class="mark-function">
  <span class="mark-keyword">
    foo
  </span>
  (
  <span class="mark-parameters">
    <span class="mark-name">
      a
    </span>
  </span>
  )
</span>

A maybe cheaper way is to have one level of <span> only and make nested tokens inherit parent marks. We essentially wrap each token with only one span and compose the classes instead:

<span class="mark-function mark-keyword">
  foo
</span>
<span class="mark-function">
  (
</span>
<span class="mark-function mark-parameters mark-name">
  a
</span>
<span class="mark-function">
  )
</span>

qbane · February 1, 2020, 8:19pm

Thanks. With these concrete examples, I think I understand better now. As @nikku pointed out, it is generally not effective to use tree.iterate to re-form a tree. More efforts are required (than I had imagined), like maintaining a stack, flatten nested tokens, to have the job done.