Novice trying to write a highlighter. Stuck on disambiguation and token depth, possibly misunderstanding

Hi. I’d like to write a CM6 highlighter for my language. I’m new, and I might be misunderstanding.

I’m having some trouble. I think I understand what’s wrong, but I’m not confident of my understanding, and if I am right, I don’t know how to resolve it.

 

The Language

The basis of my problem is a detail of the language I’m trying to parse, FSL, which is heavily reliant on what lisps and prologs call atoms, or what perl calls barewords.

In this language, you can tell from context when something is a bareword, and when something is a language token. As a result, it’s actually perfectly fine to have barewords which appear to collide with language tokens.

In the language,

foo -> bar -> baz;

This is a chain of three barewords. We can tell because they’re separated by arrows, which don’t get used anywhere else. My parser will parse this as a Chain. Underneath are three Atoms and two Arrows, plus a Terminator (the semicolon.)

This matters because language tokens tend to look like barewords. By example, you can write

state foo: { shape: circle; };

This is a StateDecl (declaration) filled with a StateDeclItem.

The language realizes from context that state is a keyword, not an atom’s name.

I wrote a language highlighter for Lezer which is able to parse this correctly. It kinda feels like PEG, which is nice. As a result, the language can parse this, which otherwise seems ambiguous:

state -> country;
state country: { label: "A nation"; };
state state: { label: "A state"; };

 

The Problem

The problem is, what I receive in the editor is a thing that’s marked Chain, and another marked StateDecl. I don’t seem to be able to highlight the Atoms or the Arrows inside of the chain, or the StateDeclItem inside the StateDecl.

What I want instead are to reference the Arrows and Strings and Numbers and Atoms and so on. Almost all my rules are compound in this way, and so I don’t entirely know how to move forwards.

As a result, I’m only able to meaningfully highlight the rules which are complete as their top-level expression, like line comments and flow declarations.

The other problem is that I’m having trouble with ambiguity resolution. If I try to promote the subordinate rules such that they’re exposed, the atoms and arrows collide with one another, because in some ways they have overlapping character sets. foo-bar is a valid atom, so the arrow -> collides on grounds of the hyphen, apparently.

 

The Ask

What I want is to be able to say “style the things inside the top level rule you gave me, instead of the top level rule itself.” Is that possible? If so, this all goes away.

If not, the alternative would be “is there a way to modify my grammar to make these sub-rules top-level, without falling afoul of ambiguity?” But I want this grammar shape, and if I can just highlight the sub-rules somehow, I’d much prefer that.

 

Reference

In case they’re relevant:

In the live editor, you’ll see that a Chain gets highlighted, but the individual atoms and arrows inside are not; similarly, the Chain has a DOM representation but its sub-elements do not.

A valid simple chain for the editor is

foo -> bar;

The ideal mock-dom for that would be (please ignore the dumb indentation, just forum formatting)

<chain>
  <atom>foo</atom> 
  <arrow>-&gt;</arrow> 
  <atom>bar</atom><terminator>;</terminator>
</chain>

You could do something like styleTags({"Chain/Arrow": someTag}) to style arrows inside a chain node, if that’s what you’re asking.

Well, it seems like that really ought to be what I’m asking about.

Still, the DOM isn’t marked up, and I can’t get rules of that shape to trigger.

A trivial case:

https://stonecypher.github.io/codemirror-lang-fsl/

If you fill this with a => "b"; you will get an unmarked row.

The editor currently has a rule for "Chain/Arrow" which should mark it as a lineComment, because that default red-brown is easy to see.

Arrow is slot 5 on Chain, on line 50 here: codemirror-lang-fsl/syntax.grammar at main · StoneCypher/codemirror-lang-fsl · GitHub .

As you can see, there is only one parsing of Chain, and it requires Arrow unambiguously at least once.

As a diagnostic, three lines above the Chain/Arrow rule, there is a (currently commented out) rule for just Chain. It also sets lineComment as a color, and if you un-comment this rule and recompile, the Chains do highlight. Therefore, I believe it is actually matching here.

So.

I believe that the thing you’re suggesting is correct. However, I’m doing something wrong, and I’m not getting the expected result.

If I inspect that editor’s syntax tree it gives Program(Chain), so it seems the problem is with your parser not emitting the nodes you are expecting.

Okay. There’s definitely something I don’t understand happening here.


I went back and I made a trivial proof of concept grammar.

@top Program { expression* }

expression {
  FooBar |
  Foo    |
  Bar
}

@tokens {
  Foo    { "foo" }
  Bar    { "bar" }
  FooBar { Foo Bar }
}

Next, I gave it a syntax coloring where Bar would be brown, whether or not it was part of FooBar.

styleTags({
  "FooBar/Bar" : t.lineComment,
  Bar          : t.lineComment,
  "( )"        : t.paren
})

Lastly, I tested it. All passing.

# Foo

foo

==>

Program(Foo)

# Bar

bar

==>

Program(Bar)

# FooBar

foobar

==>

Program(FooBar)

Tried it in the editor.

bar on its own highlights, but bar as part of the text foobar does not. Similarly, bar gets a DOM node, but foobar does not.

Weirder still, fbar highlights bar. It’s not until you type foo out that it fails. This suggests that FooBar is blocking Bar somehow.


So I try adding a rule for FooBar, right? Without /bar? No: that one also won’t highlight.

styleTags({
  "FooBar"     : t.lineComment,
  "FooBar/Bar" : t.lineComment,
  Bar          : t.lineComment,
  "( )"        : t.paren
})

Under this, Bar will highlight, but FooBar and FooBar/Bar won’t, even though the test says it parses correctly.

image


If you want to see the facile version, it’s here


Is there a way to get Lezer to just tell me how it’s parsing something?

For the life of me I can’t see how this could parse any other way, but I did make a trivial foobar grammar built from foo and bar, and it was able to see bar underneath

This would be a lot easier to understand if I could just see how these Chains were being interpreted

A token rule like FooBar { Foo Bar } will cause a token to be created for that input, which is an atomic element in the parse process. Maybe you intended for this rule to be a regular production, outside the @tokens block?

The goal was to get Bar to highlight, because that’s what doesn’t work in my real highlighter. Unfortunately that’s the only thing that works in the toy.

oh, wait.

wait, so, the way i’ve been coming at this was with the mindset of a peg person. the tokens, i thought, were the various things in the tree that could be expressed

but that’s not correct, is it? the tokens are just the things that are exported as noteworthy to the editor / highlighter, and the majority of the structure is retained as internal to lezer, rather than exported to the other tools

also, an exported token can’t contain a different token. my expectations about heirarchal embedding are wrong. we don’t paint branches, only leaves. and thus, the “Foo/Bar” syntax isn’t a convenient override for leaf-over-branch, but rather, a necessary disambiguator for leaf-of-branch-of-kind.

i thought i could paint a rule, then paint over the pieces inside. and that isn’t correct, and that’s why i was stuck.

is this mental pivot correct?

Well, regular productions with captialized names are available in the tree too.

It can refer to other tokens, and use them as a subroutine, but a token doesn’t contain anything, it’s a leaf in the syntax tree.

that answer really doesn’t clarify what i’m actually asking about

i’m trying to figure out why FooBar doesn’t allow Bar to highlight anymore

all I really have to go on is an example that could be read several different incompatible ways

i would really appreciate some tractable help. i went as far as to make a reduced example but it fails in exactly the opposite way as my real grammar and i can’t figure out why

is there some way, please, for me to see how lezer parses something

the tests give me results that appear to suggest success but you said “it’s probably not parsing the way you think it is”

i don’t know how to see how it’s parsing