So, here, I’m mostly worried about performance, because I’ve noticed that custom tokenizers tend to only be used for the content of a comment.
In my case, I have 3 types of comments, currently expressed as:
ShortComment[group=BlockInline] {
"{{!" commentContent* "}}"
}
LongComment[group=BlockInline] {
"{{!--" commentContent* "--}}"
}
HTMLComment[group=BlockInline] {
"<!--" commentContent* "-->"
}
@external tokens commentContent from "./tokens.js" {
commentContent
}
Where I think I’m getting in to trouble is that these }}
tokens are also used for other syntax (runtime interpolation).
Here is the tokenizer I’ve written:
const curlyClose = 125
const greaterThan = 62
const dash = 45
const commentEnds = [
[dash, dash, greaterThan],
[dash, dash, curlyClose, curlyClose],
[curlyClose, curlyClose],
];
export const commentContent = new ExternalTokenizer(input => {
let matchAt = (i, char) => {
return commentEnds.filter(x => x[i] === char);
}
let fullMatch = (lengthMatched, matches) => {
return matches.length === 1 && matches[0].length === i + 1;
}
let current = null;
let nextChar = input.next;
let i = 0;
let advance = () => {
input.advance();
nextChar = input.next;
}
while(!current) {
if (nextChar < 0) {
input.acceptToken(cmtToken)
break;
}
console.log('nextChar', nextChar, String.fromCharCode(nextChar));
let matches = matchAt(i, nextChar);
if (matches.length === 0) {
i = 0;
current = null;
advance();
continue;
};
if (fullMatch(i, matches)) {
current = matches[0];
console.log('matched on', String.fromCharCode(...current));
break;
}
i++;
advance();
}
if (current) {
input.acceptToken(cmtToken, current.length - 1);
}
});
atm, it infinitely loops, and I’m not sure why – but I think the infinite looping is a red herring because as I step threw the loop in the debugger, it’s parsing things after the comment, here in my input text:
<template>
{{!
simple comment
}}
{{#let greeting as |value|}}
{{value}}
{{/let}}
</template>;
So, question:
- does it make sense / is it possible to have a custom tokenizer include the start/end of a comment? how would nesting work? like, all the other comment-forms are ignored within the others (html comment isn’t treated as anythir other than comment content within the other two comments, for example))
- is a tokenizer that looks for the starts of patterns performant? should it be avoided?
- could a tokenizer that I have now know what sort of start situation I have? or would I want 3 separate tokenizers? (I’m going to try this next)