Tokenizer: why always external?

I’m considering to use Lezer in my project and have been reading its docs and sources for a while.

I noticed that despite Lezer grammar has it’s own tokenizer, all standard/example grammars use external tokenizers with a note like “Hand-written tokenizers for *** tokens that can’t be expressed by lezer’s built-in tokenizer”.

What’s the problem with it? Is it just for efficiency or it’s not capable of doing it?
Would it be able to do a C-style language without an external tokenizer?

In my project I would sacrifice the parsing efficiency for a simpler implementation.
I have already done it with PEG.js but unfortunately due to its design it can’t really handle mathematical expressions well. That’s why I’m looking at other parsers.


Almost every language has some quirks or context sensitivities that are impossible to parse with just regular expressions (which is what in-grammar token declarations give you). For others, such as the XML parser, the external tokenizers are used to improve error correction (to match as many tags as possible even if some are mismatched). So use grammar tokens whenever possible, since indeed they are easier to work with, and consider programmatic tokenizers as an escape mechanism when that doesn’t work.