Parsing HTML to markdown markup

makhnatkin · November 26, 2024, 3:34pm

Proposal

In our project, we have encountered the need to copy HTML and paste it into CodeMirror like markdown markup. Is there a parser available that supports this functionality? Can Lezer be used for this task?

Also I noticed that ProseMirror provides the DOMParser, but it only parses content into ProseMirror Node objects.

To address this, I propose extracting the functionality of DOMParser from ProseMirror and creating a standalone NPM package, such as dom-parser. This package can be used across various projects, including ProseMirror, CodeMirror, and beyond. Decoupling DOMParser simplifies integration into projects that do not require the full ProseMirror stack.

Goal

The primary goal is to make DOMParser more versatile, enabling it to convert HTML into a broader range of formats, such as markdown markup (for example <b>text</b> -> **text**). This would significantly enhance its utility for projects working with structured content.

Example Implementation

Below is an example implementation that demonstrates parsing HTML into either ProseMirror Node objects or markup:

export class DOMParser {
  // other methods...

  /// Parse a document from the content of a DOM node.
  parse(dom: DOMNode, options: ParseOptions = {}, outputFormat: "node" | "markup" = "node"): Node | string {
    let context = new ParseContext(this, options, false, outputFormat)
    context.addAll(dom, Mark.none, options.from, options.to)
    return context.finish() 
  }
}

class ParseContext {
  private markupBuffer: string = ""
  private outputFormat: "node" | "markup"

  constructor(
    readonly parser: DOMParser,
    readonly options: ParseOptions,
    readonly isOpen: boolean,
    outputFormat: "node" | "markup"
  ) {
    this.outputFormat = outputFormat
  }

  // other methods...
}

// examples
const domParser = new DOMParser(schema, rules)

// Markdown markup
const markdown = domParser.parse(domNode, {}, "markup")
console.log(markdown)

// ProseMirror node
const docNode = domParser.parse(domNode, {})
console.log(docNode)

I wanted to roughly indicate the direction of development of the parser, at the same time, the example above is still tied to the ProseMirror API → Node, Mark.none, other details

It may need to be passed to various parsing functions, and the class can become a generic

export class DOMParser<TOutput extends any = Node> {

  // Constructor accepts a schema and custom rules for parsing.

  constructor(

    readonly rules: ParseRule[] // Parsing rules for DOM

  ) {}

  

  /// Parse a document from the content of a DOM node.

  parse(dom: DOMNode, options: ParseOptions = {}, /*...maybe something else */): TOutput | string {



  }

}

There are no more precise implementation ideas, this example is rather to illustrate the proposal in more detail

What is your opinion on this matter?