Showing off: Spellchecking in CM6, without tricks

Monkatraz · June 7, 2021, 6:25am

This forum has a lot of troubleshooting articles, so I figured it would be a nice change of pace to show some of the stuff you can do with CM6.

As part of a project to create a markup editor for a wiki, I elected to add what I considered an essential feature: spellchecking. There is a hack you can do where you enable spellchecking on the editor’s DOM element - that isn’t what I’m doing here. That trick will spellcheck everything - including markup. This is faster, smarter, and fancier.

firefox_rhMT7EWEEf

This uses the spellchecker-wasm package for the spellchecking backbone. This is all done on async worker threads, so not on the main thread. This allows it to be very fast and responsive.

One thing I wished I could’ve done is used my custom-made parser for CM6 to actually “extract” the content from the editor’s document, i.e. everything that isn’t markup. This proved impractical as both my parser and bits of CodeMirror don’t run particularly well in a worker, and I felt that it was unacceptable to run the parser in the main thread. Of course, this could be mostly avoided if I only parsed the lines visible to the user, but that will add some complexity I’m not ready for yet.

Instead, I’m doing a goofy solution where I parse the document in the worker using Prism, an intentionally crude parser for syntax highlighting in HTML content. This is because Prism is very fast and runs quite nicely in a worker. In the future, we plan to hook up the markup language’s compiler/renderer so that it has a rendering mode where it does this “extraction” operation itself. We already have that compiler hooked up for live-preview and lint warnings using WASM. Once that’s done, this will be a fairly elegant operation.

Another really frickin’ cool thing is that those tooltips aren’t the Linter tooltips - they aren’t even hand-made. Those are Svelte components. I won’t go into too much detail, but it is perfectly possible to wrap around a Svelte component in such a way that it can go into most places CodeMirror accepts a DOM element - and it works, really, really well.

Anyways, that’s all. If anyone reading this is really itching for me to release this code - I don’t think that is too feasible. It’s all open source, so you can find it if you really want to, but it’s AGPL3 licensed and also dependent on some outside state to function, like translation strings.

marijn · June 7, 2021, 7:26am

Nice! What’s the download size for spellchecker-wasm + a dictionary?

Have you tried using the editor’s own syntax tree to extract the text content? (I think it shouldn’t be hard to iterate over the tree and ‘mask out’ all the markup tokens.)

Monkatraz · June 7, 2021, 7:54am

The spellchecker WASM is about ~70KB. This is actually really small for WASM, so just now when I checked that, it surprised me. The markup language compiler is a whopping 2MB. It’s all async imported and gracefully degrades, thankfully.

The dictionary is really dependent on how detailed you wanna get. The dictionary for English I made is about ~135K words, and is 2MB.

This is precisely what I meant. The way the spellchecker works is a rather simplistic right now, so this is the current strategy:

[[*a href="https://example.com" style="color: red;"]]My link[[/a]]

becomes:

                                                     My link

Basically, all the markup gets replaced with whitespace. This allows for parsing the entire document at once in the spellchecker, which prevents a need to constantly communicate with the worker.

Although, it’s obvious that this is ultimately less efficient than crawling the syntax tree, assuming a small region like the viewport. I think that would involve parsing out for the “words” on the main thread, and then sending everything found to the worker. Ultimately, this approach could even be done incrementally, checking only sections of the document that have changed.

But for right now, the way this is done is simple and easy to implement. Ultimately, either the WASM compiler will handle this operation (which may be superior as it has such a detailed understanding of the document), or I will get an incremental approach using the syntax tree working.

Monkatraz · June 7, 2021, 8:29am

I should also say one of the reasons I went for the Prism approach is because I had already made a simplistic grammar for the markup language in it. It was easy to reuse. That grammar was also made for CM6, but not to parse things:

Glorious tooltips. This is where I really push that Svelte component capability.