Word tokenization for CJK languages

Typically for a text editor, double clicking should select the closest word, in English or similar languages, it’s straightforward since delimiters are pretty clear.

However, CJK languages don’t have delimiters, word tokenizer is invented for this, macOS natively supports it. For example, we have a sentence like:

我能吞下玻璃而不伤身体

And then I double clicked somewhere between 玻 and 璃,玻璃 should be selected because it’s a word.

What drives me nuts is that contenteditable on macOS WebKit does support this, but it’s missing in CodeMirror, when I double click like the example shown above, the entire sentence will be selected.

Well, since I am working on a native macOS app (MarkEdit), I can leverage the native NLTokenizer (part of the macOS NaturalLanguage framework) to achive the goal, I have a PR here: Word tokenization for CJK languages by cyanzhong · Pull Request #77 · MarkEdit-app/MarkEdit · GitHub

However, I think this approach is very unnecessary and I might have missed something obvious. Do you have any insights?

Thanks in advance!

Hello @marijn, do you have a quick answer to why this is disabled in CodeMirror? Thanks!

CodeMirror has its own conception of a ‘group’ (a sequence of either whitespace characters, word characters, or non-whitespace non-word characters) that is selected when you double-click (or move with ctrl-arrow keys). These can be influenced with language-specific data (so that for example ‘$’ is considered a word character in PHP for easy selection/skipping of variable names).

VS Code seems to do something similar—there, clicking a string of CJK characters selects the entire group. My Firefox (Linux) also seems to do this by default, though Chrome uses proper segmenting.

Given that we also group together obvious multi-word Latin strings like getElementById for these purposes, and the current behavior aligns with VS Code, I am not sure I want to change this, at least not by default.

JavaScript does provide an API for by-word segmenting that could be used to implement something like this, if we want to explore adding an optional custom behavior.

Would you expect ctrl-rightArrow (or cmd-rightArrow) to also move by segmented word? Always, or only in certain contexts?

Thank you for the reply, it is very kind of you, Marijn.

Would you expect ctrl-rightArrow (or cmd-rightArrow) to also move by segmented word? Always, or only in certain contexts?

This is very interesting because when I implemented my own tokenization, I totally forgot this behavior, and yes this is something macOS native editors have (I might be adding it for my editor later).

I agree with the part that takes VS Code as the example, it is fair to say for a source code editor, this behavior is totally unnecessary (as most people use English as the only language for this purpose), I am talking more about editors for writing Markdown, etc.

JavaScript does provide an API for by-word segmenting that could be used to implement something like this, if we want to explore adding an optional custom behavior.

This is also a very valuable point, I wasn’t aware that there was a JavaScript API to do that, thanks for pointing out, I think it would be helpful for pure web editors, like Obsidian which doesn’t use native APIs.

For my case, I would also like to clarify a little bit, I am not asking for adding new features to support this, I was just curious about why the built-in tokenization on macOS is not available, and if there’s a trivial way to add it back, maybe because I don’t know CodeMirror has its own conception of a ‘group’ good enough.

Thanks for the great write-up!