What are strategies for having an ExternalTokenizer recognize the entirety of a comment?

NullVoxPopuli · April 16, 2023, 4:33pm

So, here, I’m mostly worried about performance, because I’ve noticed that custom tokenizers tend to only be used for the content of a comment.

In my case, I have 3 types of comments, currently expressed as:

 ShortComment[group=BlockInline] {
  "{{!" commentContent* "}}"
} 

LongComment[group=BlockInline] {
  "{{!--" commentContent* "--}}"
}

HTMLComment[group=BlockInline] {
 "<!--" commentContent* "-->"
}

@external tokens commentContent from "./tokens.js" {
  commentContent
}

Where I think I’m getting in to trouble is that these }} tokens are also used for other syntax (runtime interpolation).

Here is the tokenizer I’ve written:

const curlyClose = 125
const greaterThan = 62
const dash = 45

const commentEnds = [
  [dash, dash, greaterThan],  
  [dash, dash, curlyClose, curlyClose],
  [curlyClose, curlyClose],
];

export const commentContent = new ExternalTokenizer(input => {
  let matchAt = (i, char) => {
    return commentEnds.filter(x => x[i] === char);
  }
  let fullMatch = (lengthMatched, matches) => {
    return matches.length === 1 && matches[0].length === i + 1; 
  }

  let current = null;
  let nextChar = input.next;
  let i = 0;
  let advance = () => {
    input.advance();
    nextChar = input.next;
  }

  while(!current) {
    if (nextChar < 0) {
      input.acceptToken(cmtToken) 
      break;
    }

    console.log('nextChar', nextChar, String.fromCharCode(nextChar));
    let matches = matchAt(i, nextChar);

    if (matches.length === 0) {
      i = 0;
      current = null;
      advance();
      continue;
    };

    if (fullMatch(i, matches)) {
      current = matches[0];
      console.log('matched on', String.fromCharCode(...current));
      break;
    }


    i++;
    advance();
  }

  if (current) {
    input.acceptToken(cmtToken, current.length - 1);
  }
});

atm, it infinitely loops, and I’m not sure why – but I think the infinite looping is a red herring because as I step threw the loop in the debugger, it’s parsing things after the comment, here in my input text:

<template>
  {{! 
    simple comment 
  }}
  
  {{#let greeting as |value|}}
    {{value}}
  {{/let}}
</template>;

So, question:

does it make sense / is it possible to have a custom tokenizer include the start/end of a comment? how would nesting work? like, all the other comment-forms are ignored within the others (html comment isn’t treated as anythir other than comment content within the other two comments, for example))
is a tokenizer that looks for the starts of patterns performant? should it be avoided?
could a tokenizer that I have now know what sort of start situation I have? or would I want 3 separate tokenizers? (I’m going to try this next)

marijn · April 16, 2023, 4:57pm

Obviously. Nesting would just involve a counter in the tokenizer function.

Not sure what this means.

Yes, you’ll want 3 different tokenizers.

NullVoxPopuli · April 16, 2023, 5:11pm

thank you!

I’ve already started seeing benefits with this strategy:

import { 
  htmlCommentContent as htmlCommentToken, 
  longCommentContent as longCommentToken, 
  shortCommentContent as shortCommentToken 
} from './syntax.grammar.terms';


const shortCommentEnd = [curlyClose, curlyClose];
const longCommentEnd = [dash, dash, curlyClose, curlyClose];
const htmlCommentEnd = [dash, dash, greaterThan];


export const shortCommentContent = new ExternalTokenizer((input) => {
  return matchForComment(shortCommentEnd, shortCommentToken, input);
});

export const longCommentContent = new ExternalTokenizer((input) => {
  return matchForComment(longCommentEnd, longCommentToken, input);
});

export const htmlCommentContent = new ExternalTokenizer((input) => {
  return matchForComment(htmlCommentEnd, htmlCommentToken, input);
});

// ---


function matchForComment(commentEndPattern, commentToken, input) {
  let nextChar = input.next;
  let i = 0;
  let advance = () => {
    input.advance();
    nextChar = input.next;
  };

  while (nextChar) {
    console.log({ nextChar, i, commentEndPattern, commentToken });
 
    // exit condition
    if (nextChar < 0) {
      input.acceptToken(commentToken);

      break;
    }

    let hasMatch = commentEndPattern[i] === nextChar;

    // waiting to see if we should check if we're close to the "end token" of the comment
    if (!hasMatch) {
      i = 0;
      advance();
      continue;
    }

    // full match of the end comment pattern!
    if (i === commentEndPattern.length - 1) {
      input.acceptToken(commentToken, 0 - commentEndPattern.length + 1);

      break;
    }

    // if we haven't continue'd or break'd, 
    // we have a partial match, and should check yet the next character
    i++;
    advance();
  }
}

ShortComment[group=BlockInline] {
  "{{!" shortCommentContent* "}}"
} 

LongComment[group=BlockInline] {
  "{{!--" longCommentContent* "--}}"
}

HTMLComment[group=BlockInline] {
 "<!--" htmlCommentContent* "-->"
}

@external tokens shortCommentContent from "./tokens.js" {
  shortCommentContent
}
@external tokens longCommentContent from "./tokens.js" {
  longCommentContent
}
@external tokens htmlCommentContent from "./tokens.js" {
  htmlCommentContent
}

I have some infinite looping to work out still, but It’s coming along.