Matching all text until a specific token is encountered (without ExternalTokenizer)

Hi!

Using an ExternalTokenizer I’ve managed to write a grammar that can turn the following content:


∞∞∞text-a
This is the content of a text Note which has the Auto flag
∞∞∞text
This is the content of another text Note that does not have the Auto flag

into this tree:

Document (0,151)
    Note (0,69)
        NoteDelimiter (0,11)
            NoteLanguage (4,8) // "text"
            Auto (8,10)        // the "-a"
        NoteContent (11,69)
    Note (69,151)
        NoteDelimiter (69,78)
            NoteLanguage (73,77)
        NoteContent (78,151)

which is exactly what I want. However, I’ve only managed to do this with an ExternalTokenizer for parsing NoteContent nodes. I wonder if it’s possible to do it using only Lezer grammar?

Any pointers to how this can be done would be greatly appreciated!

For reference, I’m posting the grammar and the code for the ExternalTokenizer that I currently use below (sorry for pasting so much code).

Best,
Jonatan

Grammar:

@top Document { Note* }

Note {
  NoteDelimiter NoteContent
}

NoteDelimiter {
    noteDelimiterEnter noteDelimiterMark NoteLanguage Auto? noteDelimiterEnter
}

@tokens {
    noteDelimiterMark { "∞∞∞" }
    NoteLanguage { "text" | "javascript" | "json" | "python" }
    Auto { "-a" }
    noteDelimiterEnter { "\n" }
}

@external tokens noteContent from "./external-tokens.js" { NoteContent }

ExternalTokenizer:

import { ExternalTokenizer } from '@lezer/lr'
import { NoteContent } from "./parser.terms.js"

const EOF = -1;
const FIRST_TOKEN_CHAR = "\n".charCodeAt(0)
const SECOND_TOKEN_CHAR = "∞".charCodeAt(0)

const tokenRegEx = new RegExp(`^\\n∞∞∞(text|javascript|json|python)(-a)?\\n`, "g")

export const noteContent = new ExternalTokenizer((input) => {
    let current = input.peek(0);
    let next = input.peek(1);

    if (current === EOF) {
        return;
    }

    while (true) {
        // unless the first two characters are a newline and a "∞" character, we don't have a note content token
        // so we don't need to check for the rest of the token
        if (current === FIRST_TOKEN_CHAR && next === SECOND_TOKEN_CHAR) {
            let potentialLang = "";
            for (let i=0; i<18; i++) {
                potentialLang += String.fromCharCode(input.peek(i));
            }
            if (potentialLang.match(tokenRegEx)) {
                input.acceptToken(NoteContent);
                return;
            }
        }
        if (next === EOF) {
            input.acceptToken(NoteContent, 1);
            return;
        }
        current = input.advance(1);
        next = input.peek(1);
    }
});

Yes, as of recent versions the local token group feature might help with this.

Thanks for the reply - that’s great to hear! I’ve now read and re-read the documentation on local token groups and made some trial and error to get the parser to do what I want. However, I’m afraid I’m out in deep water, and I feel like I’ve only grasped a small percentage of Lezer (and syntax parsers in general), so I have not been very successful.

I’ll do some more experimenting tomorrow, but any pointers on how I could use local token groups for my use case would be extremely helpful.

The closest I’ve gotten (I think) is the following grammar:

@top Document { Note* }

Note {
  NoteDelimiter NoteContent
}

@local tokens {
    NoteDelimiter {
        noteDelimiterEnter noteDelimiterMark NoteLanguage Auto? noteDelimiterEnter
    }
    @else NoteContent
}


@tokens {
    noteDelimiterMark { "∞∞∞" }
    NoteLanguage { "text" | "javascript" | "json" | "python" }
    Auto { "-a" }
    noteDelimiterEnter { "\n" }
}

But for some reason, the above grammar results in a parser that doesn’t create nodes for NoteLanguage and Auto. The generated .terms.js file also doesn’t contain ids for those nodes. The above grammar ran on the example content in my first post results it the following tree:

Document (0,151)
    Note (0,69)
        NoteDelimiter (0,11)
        NoteContent (11,69)
    Note (69,151)
        NoteDelimiter (69,78)
        NoteContent (78,151)

How can I make the parser also include NoteLanguage and Auto in the tree? Or am I completely misunderstanding the @local token feature :)?

Could it be a bug that is causing some of the tokens (NoteLanguage and Auto) to not show up in the syntax tree even though they start with uppercase letters?

You aren’t using them in any grammar rules – you should be getting warnings about unused tokens when you compile this grammar.

I don’t get any warnings when I compile it:

> lezer-generator src/editor/lang-heynote/heynote.grammar -o src/editor/lang-heynote/parser.js

Wrote src/editor/lang-heynote/parser.js and src/editor/lang-heynote/parser.terms.js

If I change the grammar to this:

@top Document { Note* }

Note {
    NoteDelimiter NoteContent
}

NoteDelimiter {
    noteDelimiterEnter noteDelimiterMark NoteLanguage Auto? noteDelimiterEnter
}

@local tokens {
    NoteDelimiter {
        noteDelimiterEnter noteDelimiterMark NoteLanguage Auto? noteDelimiterEnter
    }
    @else NoteContent
}


@tokens {
    noteDelimiterMark { "∞∞∞" }
    NoteLanguage { "text" | "javascript" | "json" | "python" }
    Auto { "-a" }
    noteDelimiterEnter { "\n" }
}

I get an error message saying:

Duplicate definition of rule 'NoteDelimiter' (src/editor/lang-heynote/heynote.grammar 7:0)

Oh, right, I misread that. Tokens are atomic things, that don’t nest. You can refer to other token names in them, but they will just be included as part of the outer token. It looks like you want things like Auto and NoteLanguage to be tokens in the @local tokens blocks, and have NoteDelimiter be a regular nonterminal rule.

Hmm, if I put all those tokens directly under the @local tokens block, it’ll fail to parse NoteContent which is in the @else rule. I assume that’s because it’ll break on any of the local tokens, but I want to always parse it as NoteContent unless it encounters a whole NoteDelimiter.

I see. That might be difficult to do with @local tokens—those assume a single set of valid tokens, whereas in your case NoteContent can appear in two contexts (before and after Auto token). You could have NoteLanguage and NoteLanguageAuto tokens to kludge around this, but that’s not great either (they won’t be separate tokens). Possibly the original external tokenizer is the best way to do this after all.