Merging adjacent tokens makes completion much harder when it merges adjacent bracket tokens

stof · March 7, 2025, 10:41am

Merge adjacent tokens created by stream parsers · codemirror/language@396019f · GitHub has introduced merging of adjacent tokens of the same type.
This broke our completion logic that has support for completion for arguments of some special functions in our language because it merges successive bracket into the same token, which makes it much harder to detect when the completion is triggered just after the opening brace of a (relevant) function call.
My language grammar was intentionally matching braces one by one to create bracket tokens and this merging is totally undesirable in my case (other tokens are properly parsed in a greedy way anyway and won’t produce such adjacent tokens). Would it be possible to make this merging behavior configurable per language ?

In case this is relevant, my language parser is defined using simple-mode.

marijn · March 11, 2025, 8:03am

Many of the old tokenizers in legacy-modes rely on this kind of merging, because they’ll return some tokens in smaller pieces. Is there any chance you’d be able to switch to a Lezer-based parser for your system? Those are generally superior to stream parsers when it comes to things like semantic completion, because they produce much more structured output.

stof · March 11, 2025, 5:52pm

Using a Lezer-based parser won’t be easy for my system because it will force me to add an additional build step or to find a way to integrate the lezer build inside a webpack loader to have it in my existing build. Are you aware of an existing webpack loader for lezer grammars ?

The simple-mode of the legacy-modes worked well for me since years to support my grammar (which is quite simple). I was already using it in CM5.
Merging tokens is quite new (it was added 4 months ago, while I’m using the simple mode without merging since 9 years, and legacy modes were also used in the ecosystem before the merge). It would still be great to be able to control this merging (either at the language level by disabling it entirely, or by marking some token types as unmergeable).

marijn · March 12, 2025, 11:49am

Do these two patches look like they would help for you?

stof · March 12, 2025, 12:10pm

It looks like you linked to the same patch twice. I guess you meant Allow simple modes to pass a mergeTokens option · codemirror/legacy-modes@e75bbe2 · GitHub for the second patch.

Those patches look like they could solve my use case indeed.

marijn · March 13, 2025, 2:01pm

Great. I’ve tagged releases including those patches.