Is my text format appropriate to be parsed with Lezer?

I want to add syntax highlighting for my text format to codemirror, but I am inexperienced with parsing. I saw that the markdown language repo mentions that lezer isn’t appropriate for parsing markdown. Before I try to figure out Lezer, I’m wondering if you think my format is suitable to be parsed that way, or what other technique I might try.

This is the format used in my product, tasktxt.com, I hope to migrate the editor to use codemirror.

My format is like this:

A line preceded by two blank lines (or the start of the document) is considered a “task”. Any lines following this line are “notes” for that task, unless there are two blank lines, in which case the line after that is a separate task. The notes are just text, but the first line of a task has several items that should be parsed within it.

this is the first task
these are notes about the first task

these are notes about about the first task


[x] this is the second task


this is the third task 10m

The first line of the task has these components:

  • an optional ‘checkbox’ [x]
  • any text
  • an optional guess duration in the format 5m 30s
  • a divider followed by an optional duration, and timestamp / 1m [10:00:00am]

The tricky thing for me is that the “text” portion can have things that look like durations or timestamps within it. For example:

this is some text 5m / 1m [10:00:00am] more text 10m / 20s [12:30:00am]

In that example, the portion that says 5m / 1m [10:00:00am] is part of the plain old text, it’s not parsed, but the later 10m / 20s [12:30:00am] is meant to be parsed.

I hope that is clear enough and I’d be very grateful for any pointers on this, thank you.

I think that language should be parsable with an LR grammar, yes.

Thanks, that is encouraging. I’ve been reading and re-reading the docs, but I’m very new to this. Maybe as a starting point someone can point me in the right direction for parsing these examples?

I’m playing around with writing a lezer grammar, but I really don’t know what techniques I should be looking at to handle these situations where the slash must appear at the end, and prior slashes are ignored.

Furthermore, anything appearing after the slash would need to follow a specific format, and if it fails to match that format, the entire line should be interpreted as ‘text’.

hello world /
which should be parsed as

text: 'hello world '
divider: '/'`

hello / world /
which should be parsed as

text: 'hello / world '
divider: '/'

Well, I ended up using the StreamLanguage/StreamParser technique, which is much easier for me to wrap my head around (as someone with little experience in parsing), and I got my system to work (which I’m thrilled about), so I’m satisfied, but would be interested to learn more if anyone comes across this and can provide an example to the above question.

Hi! I’m Kaz. We’re back to try using Lezer for our app.

Furthermore, anything appearing after the slash would need to follow a specific format, and if it fails to match that format, the entire line should be interpreted as ‘text’.

Does anybody know how to parse this case or is it even possible with Lezer? Basically, I like to know how to handle the last occurring of particular characters.

A format where the tokenizer needs to look ahead to figure out what kind of token it is seeing isn’t super suited for this type of parser system, but you can usually write somewhat convoluted external tokenizers to make it work anyway.

1 Like