Dedent tokens in Lezer

leftium · January 12, 2020, 6:32am

How would you write a Lezer grammar for a language where indentation is significant like Python, CoffeeScript, or TaskPaper?

It seems treesitter grammars use an external tokenizer. Would a Lezer grammar use a similar method?

What would be the best strategy for that external tokenizer? Could it keep a state of the current indentation level? Or perhaps it could be easily computed from the inputs to the tokenizer?

marijn · January 12, 2020, 10:38am

Yes, a stateful external tokenizer that tracks indentation is the usual solution for this. It might also be possible (though I haven’t looked deeply enough into this to be sure) to do most of the tokenizing in the Lezer grammar itself and just have external tokens for dedent markers or something similar.

leftium · January 12, 2020, 4:47pm

How does a stateful external tokenizer interact with Lezer when parsing splits at an abiguity? Also can that state become invalid between incremental parsings?

I checked what treesitter does: serialize and deserialize external tokenizers so each path of an ambiguous parsing has an external tokenizer with its own state. The serialized tokenizer state is stored in the syntax tree. I’m not sure how I would do this in Lezer. Is there a way to access the Lezer syntax tree from a tokenizer?

I haven’t figured out how treesitter handles incremental parsing with regards to the state of external tokenizers, yet. Perhaps since the tokenizer state is stored in the syntax tree, if a part of the syntax tree can be re-used, so can the state of tokenizer in that part of the tree.

marijn · January 12, 2020, 7:34pm

Oh, that’s a good question—right now, it has no real support for this. So that’d have to be added first. I’d lean towards allowing a grammar (not a tokenizer) to define a state, which is notified of the various state transitions and copied when a parse stack is split.

marijn · April 14, 2020, 9:58pm

Turns out stack inspection can be used to implement Python-style significant indentation. There’s a working Python grammar at https://github.com/lezer-parser/python/ now