Markdown and Latex syntax highlighting

bzrr · May 14, 2022, 9:35pm

Hi,

I’m wondering what would be the recommended way to get both Markdown and Latex syntax highlighting working simultaneously, and I noticed there’s a function called parseMixed in the @lezer/common package to select a different parser for particular nodes. I would want for text surrounded by single (inline math mode) or double (display math mode) dollar-signs to be highlighted as Latex and everything outside as Markdown. Should I read the ranges from each node given by parseMixed and find matches with regex or should I extend the @lezer/markdown to add new node types for these sections? Also, would I have to modify the markdown table extension to allow for latex within tables?

marijn · May 14, 2022, 9:52pm

Yes, write an extension for the markdown parser that recognizes dollar sign markup, and either directly integrate the math parsing in there, or use parseMixed to enable some kind of LaTeX parsing inside those nodes.

bzrr · May 15, 2022, 2:23am

I copied the code for the Strikethough extension and made it work with dollar signs. Don’t know if I’m doing this correctly, but the mixed parsing wrapper became simply this:

const latexWrapper = parseMixed((node, input) => {
    if (node.type.name === "InlineMath") {
        return { parser };
    }
    return null;
});

Problem I’m running into now is that I would like to have a <div class="cm-math"> wrap the latex elements but I guess the parser replaces the “InlineMath” node with its own nodes. What’s the correct way of doing this?

bzrr · May 15, 2022, 3:34am

Nevermind. Just had to do:

const latexWrapper = parseMixed((node, input) => {
    if (node.type.name === "InlineMath") {
        return { parser, overlay: [{ from: node.from, to: node.to }] };
    }

    return null;
});

bzrr · May 15, 2022, 8:25am

@marijn I was able to do the inline parser, but I’m having trouble parsing the latex blocks. This is what I got:

parse: (cx: BlockContext, line: Line) => {
    if (!line.text.startsWith("$$")) {
        return false;
    }

    const startFrom = line.pos;
    const startTo   = line.pos + 2;

    while (cx.nextLine()) {
        if (line.text.startsWith("$$")) {
            const mark = cx.elt(mathBlockMark, cx.lineStart, cx.lineStart + 2);
            const elt  = cx.elt(mathBlockNode, startFrom, startTo, [mark]);
            cx.addElement(elt);
            return true;
        }
    }

    return false;
}

Am I doing this correctly?

marijn · May 16, 2022, 7:20am

I’m not sure how this block math markup works, but that code looks a bit dubious — you’re returning false when there’s only a single line prefixed with $$, and creating separate mabhBlockNode elements for every prefixed line beyond the first (but not for the first).

bzrr · May 16, 2022, 8:58am

Yeah, that was a bit off, but I think I figured out how to do it.

  parse: (cx: BlockContext, line: Line) => {
      if (!line.text.startsWith("$$")) {
          return false;
      }

      const start = cx.lineStart;
      while (cx.nextLine()) {
          if (line.text.startsWith("$$")) {
              cx.addElement(cx.elt(mathBlockNode, start, cx.lineStart + 2));
              cx.nextLine();
              return true;
          }
      }
      return false;
  }

bzrr · May 21, 2022, 8:03am

So, that ended up not working lol. What I wanna do is parse everything between the dollar signs (including the dollar signs themselves) as a MathBlock, regardless of where they are (I think this is how FencedCode is parsed). I guess doing this with parseBlock.parse won’t work since the start of a block doesn’t have to be at the beginning of a line. Anyway, the following examples should all be valid:

Example 1

$$2x+1$$

Example2

$$
2x+1$$

Example 3

test $$
2x+1
$$

Example 4

test $$
2x+1
$$

@marijn How do I go about doing this?

marijn · May 24, 2022, 11:29am

If these can occur in inline text then it looks like you’ll have to define an inline parser for them.

bxff · May 24, 2022, 12:11pm

Hello there, I am also trying to do the same thing, I was wondering what parser are you using, are you using sTeX from the legacy parser?

I also managed to create a simple inline parser based off Strikethough extension, and I haven’t used the block parser so I may not be able help you unfortunately .

personalizedrefriger · June 20, 2022, 2:59am

This is similar to what was included above, but this is the parser I’m currently using to mark regions as InlineMath or BlockMath.

The variant of markdown the parser is for, however, only supports $$ at the beginning of a line and uses $ for inline math.

I haven’t tried using it yet, but it looks like a 3rd-party lezer parser for TeX exists (lezer-tex on npm).

personalizedrefriger · June 24, 2022, 7:35am

A version that uses the sTeX parser can be found here.

Edit: This post originally contained question about the usage of what I thought was an @internal constructor. I was was confused. The original question is below:

Details

This version of the parser does, however, use an @internal version of cx.elt(...):

github.com

lezer-parser/markdown/blob/main/src/markdown.ts#L901

      
        
              finishLeaf(leaf: LeafBlock) {
                for (let parser of leaf.parsers) if (parser.finish(this, leaf)) return
                let inline = injectMarks(this.parser.parseInline(leaf.content, leaf.start), leaf.marks)
                this.addNode(this.buffer
                  .writeElements(inline, -leaf.start)
                  .finish(Type.Paragraph, leaf.content.length), leaf.start)
              }
            
            
  /// Create an [`Element`](#Element) object to represent some syntax
              /// node.
              elt(type: string, from: number, to: number, children?: readonly Element[]): Element
              elt(tree: Tree, at: number): Element
              elt(type: string | Tree, from: number, to?: number, children?: readonly Element[]): Element {
                if (typeof type == "string") return elt(this.parser.getNodeType(type), from, to!, children)
                return new TreeElement(type, from)
              }
            
            
  /// @internal
              get buffer() { return new Buffer(this.parser.nodeSet) }
            }

github.com

laurent22/joplin/blob/689931605722ca6d57cafe06b15089613e641b8b/packages/app-mobile/components/NoteEditor/MarkdownTeXParser.ts#L147

      
        
            						// Remove the ending delimiter
            						stop = cx.lineStart + lineLength - endMatch[0].length;
            					} else {
            						stop = cx.lineStart;
            					}
            				}
            
            
				// Label the region. Add two labels so that one can be removed.
            				const contentElem = cx.elt(BLOCK_MATH_CONTENT_TAG, start, stop);
            				cx.addElement(
            					cx.elt(BLOCK_MATH_TAG, start - delimLen, stop + delimLen, [contentElem])
            				);
            
            
				// Don't re-process the ending delimiter (it may look the same
            				// as the starting delimiter).
            				cx.nextLine();
            
            
				return true;
            			}
            
            
			return false;

I’m using this constructor to nest elements. Is there some other way I should be doing this?

marijn · June 24, 2022, 11:15am

BlockContext.elt is public. Could you elaborate on what is internal about this use?

personalizedrefriger · June 24, 2022, 11:31am

Sorry! I was confused!

I was looking at the Element constructor:

github.com

lezer-parser/markdown/blob/f94089de559314239630f3b7bd53af4e76f3786a/src/markdown.ts#L1315

      
        
            export class Element {
              /// @internal
              constructor(
                /// The node's
                /// [id](https://lezer.codemirror.net/docs/ref/#common.NodeType.id).
                readonly type: number,
                /// The start of the node, as an offset from the start of the document.
                readonly from: number,
                /// The end of the node.
                readonly to: number,
                /// The node's child nodes @internal
                readonly children: readonly (Element | TreeElement)[] = none
              ) {}
            
            
  /// @internal
              writeTo(buf: Buffer, offset: number) {
                let startOff = buf.content.length
                buf.writeElements(this.children, offset)
                buf.content.push(this.type, this.from + offset, this.to + offset, buf.content.length + 4 - startOff)
              }

(which I am not using).

fcollonval · May 9, 2023, 8:43am

Thanks a lot for sharing all those information. This allows us to activate mathematical expression highlighting in JupyterLab 4 (that switches to CodeMirror 6). For interested dev, you can have a look at:

github.com/jupyterlab/jupyterlab

Add math expression parser for markdown

jupyterlab:master ← fcollonval:ft/latex-md-parser

opened 04:56PM - 08 May 23 UTC

fcollonval

+478 -114

## References Fixes #14155 ## Code changes Add custom Markdown extension to parse LaTeX mathematical expressions. ## User-facing changes Mathematical expression are highlighted in CodeMirror Markdown editors. Before ![image](https://user-images.githubusercontent.com/8435071/237041897-d44a1912-698d-4bbf-a0d9-b31ff85fa913.png) After ![image](https://user-images.githubusercontent.com/8435071/236884408-61e014a1-2d19-4013-b9d1-9e845c8c1a11.png) ## Backwards-incompatible changes None