How to skip positions of certain inline nodes in Markdown parser extension.

bxff · May 18, 2022, 3:32pm

I have a use case where I would like to skip ranges of certain parsed nodes when creating a custom inline parser, using the extension architecture of the Markdown lezer parser.

So heres what I am trying to do, I what to parse links, which are not written in the markdown formate (Not this[LinkName](Linkitself "Sometitle") but rather just Linkitself), and just in plane link formate. I have gotten a Regex parser which such links, but it cannot distinguish links which are parsed before using the proper formate, and finds every link formate rather it be in the link node, or in the paragraph one. I need a way to distinguish links which are already parsed.

In the markdown.ts, where there is a similar situation to find older nodes, cx.parts are used, but it is not available in the extension architectures InlineContext.

Also in the code bellow, giving the position of cx.end to the parser would make it skip every other parser, how can I tell the parser that I have parced said line, but also for it not skip other parser?

Heres some code I am trying to figure out:

export const HTTPLink: MarkdownConfig = {
	defineNodes: ["HTTPLink"],
	parseInline: [{
		name: "HTTPLink",
		parse(cx, next, pos) {
			let match, indexes = [];
			while (match = regexp.exec(cx.text))
				indexes.push([match.index, match.index + match[0].length]);
			if (indexes) {
				indexes.forEach((v, i, a) => {
					cx.addElement(cx.elt('HTTPLink', v[0], v[1]))
				})
			}
			return -1 // actually not sure for the moment
		},
		after: "LinkEnd"
	}]
}

Thanks.

marijn · May 18, 2022, 3:53pm

Inline content is parsed in order, so your parse should only parse links directly at pos, not all over the current block, and return the end of the link when it finds one.

bxff · May 19, 2022, 1:14pm

I managed to get it working as you said using checking if pos is the starting of the match starting, but I am still getting the problem of links being confused with italics. Take the example of https://google.com/_italic_, here italic is also being parse. I have also tried putting before: "Emphasis".

What should I do so that I can tell the editor not to parse italics inside it?

Heres the code I am working with:

// HTTP Links: https://google.com
let regexp = new RegExp(/https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/gi) // https://stackoverflow.com/a/3809435
export const HTTPLink: MarkdownConfig = {
	defineNodes: ["HTTPLink"],
	parseInline: [{
		name: "HTTPLink",
		parse(cx, next, pos) {
			let match, indexes = [];
			while (match = regexp.exec(cx.text))
				indexes.push([match.index, match.index + match[0].length]);
			if (indexes) {
				indexes.forEach((v, i, a) => {
					if (v[0] == pos) {
						return cx.addElement(cx.elt('HTTPLink', v[0], v[1]))
					}
				})
			}
			return -1
		},
		before: "Emphasis"
	}]
}

marijn · May 19, 2022, 7:43pm

Make sure your regexp doesn’t consume markup you want to leave alone? Though generally, if people type underscores in the context of a URL, those do belong to the URL.

bxff · May 19, 2022, 8:44pm

That’s the problem… Isn’t before property suppose to make such parsing done before emphasis, and make emphasis skip such ranges? I say this because in markdown.ts, InlineCode is handled in such manor where any parsing is not done inside of InlineCode. If this is not the case whats exactly the importance of the before and after property?

I also added support for WikiLinks([[WikiLink]]), and saw the parsers pos skips the Delimiters, but not the element in between of the Delimiters which is why I thought that when returned the position of the the element parsed, the parser wouldn’t go through that range, whoes parser position would be the determinat of which ranges to go through or not. I also saw that before adding wikilinks extention the parser recognize in inside brackets as a link, but after adding the wikilink, the link is not parsed, which would make sense and when the parser is before that of Link, and when I specify after LinkEnd, the wikilink parser doesn’t work, which is as intended, if thats how before and after should work, i.g. parse something before the other, and skip ranges.

marijn · May 20, 2022, 6:12am

No, the inline parser moves forward over the input one character at a time, and at each point, the inline parsers are allowed to recognize a token in order of precedence. Those tokens can consume text that would be handled by other parsers.