Syntax highlighting for non-english words

We are trying to get codemirror (from jupyter-notebook) to highlight non-english words in an attempt to create an environment for a non-english programming language. For now, I started out by just editing the language definitions for python in codemirror/mode/python/python.js (to just get my legs wet)

For some reason when I add something like “eee” and “eeé” to one of the variables like commonKeywords (that contain a list of python keywords), “eee” ends up being bolded/colored while “eeé” does not. Any non plain-english characters seem to break the highlighting.

Do you have any suspensions to why that is?

The python mode’s wordRegexp helper function terminates the resulting regexp with a \b (word boundary marker). JavaScript regexps are… pretty dumb when it comes to unicode, and will only consider ASCII alphanumeric characters as word characters. Thus, the position between é and the character after it will not typically count as a word boundary, causing the regexp to not match.

But if you’re writing your own mode you don’t have to use that regexp approach (a better way is probably to match an entire identifier, including the non-ascii identifier characters you want to support, and then look the resulting word up in a dictionary of keywords).