StreamParser can produce excess markup

MusikAnimal · November 5, 2024, 9:48pm

Consider this demo, which parses external links. The rendered HTML is:

<div class="cm-line">
  <span class="ͼc">https://</span>
  <span class="ͼc">example.</span>
  <span class="ͼc">org</span>
</div>

What I would expect instead is this:

<div class="cm-line">
  <span class="ͼc">https://example.org</span>
</div>

where it combines adjacent elements with the same CSS class into one element. Using very similar (but adapted) code, I do not see this same issue with CodeMirror 5.

I’m having this problem with all rendered output in my CM6 parser (not just external links), so I don’t think it’s an issue with my regular expressions, etc. But there’s definitely a possibility I’m just doing something stupid.

So my question is, does this behaviour sound at all familiar? Is it expected in CM6? On very large documents, the difference in the HTML size can be quite significant, which is the issue.

marijn · November 6, 2024, 8:22am

Most parsers will generate a single token for a stretch of content with a single type, so I don’t think the difference in output complexity is all that big, in typical scenarios. But I guess merging tokens should be mostly harmless. Attached patch implements this.

MusikAnimal · November 7, 2024, 6:46pm

Yay! Thank you so much