which is exactly what I want. However, I’ve only managed to do this with an ExternalTokenizer for parsing NoteContent nodes. I wonder if it’s possible to do it using only Lezer grammar?
Any pointers to how this can be done would be greatly appreciated!
For reference, I’m posting the grammar and the code for the ExternalTokenizer that I currently use below (sorry for pasting so much code).
import { ExternalTokenizer } from '@lezer/lr'
import { NoteContent } from "./parser.terms.js"
const EOF = -1;
const FIRST_TOKEN_CHAR = "\n".charCodeAt(0)
const SECOND_TOKEN_CHAR = "∞".charCodeAt(0)
const tokenRegEx = new RegExp(`^\\n∞∞∞(text|javascript|json|python)(-a)?\\n`, "g")
export const noteContent = new ExternalTokenizer((input) => {
let current = input.peek(0);
let next = input.peek(1);
if (current === EOF) {
return;
}
while (true) {
// unless the first two characters are a newline and a "∞" character, we don't have a note content token
// so we don't need to check for the rest of the token
if (current === FIRST_TOKEN_CHAR && next === SECOND_TOKEN_CHAR) {
let potentialLang = "";
for (let i=0; i<18; i++) {
potentialLang += String.fromCharCode(input.peek(i));
}
if (potentialLang.match(tokenRegEx)) {
input.acceptToken(NoteContent);
return;
}
}
if (next === EOF) {
input.acceptToken(NoteContent, 1);
return;
}
current = input.advance(1);
next = input.peek(1);
}
});
Thanks for the reply - that’s great to hear! I’ve now read and re-read the documentation on local token groups and made some trial and error to get the parser to do what I want. However, I’m afraid I’m out in deep water, and I feel like I’ve only grasped a small percentage of Lezer (and syntax parsers in general), so I have not been very successful.
I’ll do some more experimenting tomorrow, but any pointers on how I could use local token groups for my use case would be extremely helpful.
But for some reason, the above grammar results in a parser that doesn’t create nodes for NoteLanguage and Auto. The generated .terms.js file also doesn’t contain ids for those nodes. The above grammar ran on the example content in my first post results it the following tree:
Could it be a bug that is causing some of the tokens (NoteLanguage and Auto) to not show up in the syntax tree even though they start with uppercase letters?
> lezer-generator src/editor/lang-heynote/heynote.grammar -o src/editor/lang-heynote/parser.js
Wrote src/editor/lang-heynote/parser.js and src/editor/lang-heynote/parser.terms.js
Oh, right, I misread that. Tokens are atomic things, that don’t nest. You can refer to other token names in them, but they will just be included as part of the outer token. It looks like you want things like Auto and NoteLanguage to be tokens in the @local tokens blocks, and have NoteDelimiter be a regular nonterminal rule.
Hmm, if I put all those tokens directly under the @local tokens block, it’ll fail to parse NoteContent which is in the @else rule. I assume that’s because it’ll break on any of the local tokens, but I want to always parse it as NoteContent unless it encounters a whole NoteDelimiter.
I see. That might be difficult to do with @local tokens—those assume a single set of valid tokens, whereas in your case NoteContent can appear in two contexts (before and after Auto token). You could have NoteLanguage and NoteLanguageAuto tokens to kludge around this, but that’s not great either (they won’t be separate tokens). Possibly the original external tokenizer is the best way to do this after all.