Ambiguity between objects and block of statements

kearfy · March 30, 2024, 4:28pm

Heya! I’ve been struggling with trying to make a differentiation between json-like objects, and a block of statements conveniently also wrapped in the same {} brackets, for SurrealQL (SurrealQL | SurrealDB Docs). I have it almost figured out, but the way Lezer (in some cases) will prioritise global tokens over what is positionally possible makes this difficult to nail down.

SurrealQL is an SQL-like language, meaning we have Idents (identifiers, like fields on a table). I use this as a basic building block like:

rawident {
		(@asciiLetter | "_") (@asciiLetter | @digit | "_")+ |
		@digit+ (@asciiLetter | "_") (@asciiLetter | @digit | "_")+
}

These rawidents are used to extend on, and for keywords of the language with:

@external extend { rawident } tokens from "./tokens" {
	select [@name=Keyword],
}

Combined with JS logic:

import { select } from "./parser.terms";
const tokenMap = { select };
export const tokens = function(t) {
	return tokenMap[t.toLowerCase()] ?? -1;
}

And as for extending on the raw idents

Ident {
		rawident |
		"`" ![`]+ '`'
}
ObjectKey {
		( rawident | singleString | doubleString ) ":"
}

At this point, because of the combination of the rawident and :, I can prioritise ObjectKey over a normal ident. Nice, the separation between objects and blocks works! Almost…

In SurrealQL, one of the building blocks are Record IDs, formatted like table:unique_id, table being an ident, unique_id being a number, string, array or object. ObjectKey is only used in the context of objects, so after a {, and in that scenario the “ident :” syntax should take precedence ({ person:123 } will be an object with property person), but in any other scenario, Record ID syntax should obviously take precedence.

I can see how this post lacks some context, I’m not sure what’s most important to provide for you to get an idea of what I’m trying to do, and how I’m trying to solve it. Please let me know what additional information you need, I can also post the full grammar I have thus far, but it’s getting quite lengthy due to the amount of statements.

Thanks and Happy Easter!

kearfy · March 30, 2024, 4:34pm

On additional thing which ties into my whole Ident structure, I have a select statement like

SelectStatement {
	(select)
	(
		value Predicate |
		commaSep<InclusivePredicate>
	)
	(from)
	(only)?
	(
		NestableStatement |

		commaSep<Value>
		WithClause?
		WhereClause?
		SplitClause?
		GroupClause?
		// // TODO order by clause
		LimitStartComboClause?
		FetchClause?
		TimeoutClause?
		ParallelClause?
		ExplainClause?
	)
}

Where Ident is part of Value, but a statement like SELECT * FROM test causes the test Ident to throw an error. This worked before at some point, so I’m not sure when it broken. It’s unfortunate that Lezer doesn’t give much context on what goes wrong, but I’m starting to get clueless

Thanks again!

kearfy · April 3, 2024, 7:17am

@marijn I guess my question is if it is possible to have contextual precedence. So rawident ":" would only have precedence over just an ident right after a { token, to be able to decide if something is a block or an object. I have a feeling something like this is possible with external logic, but looking at it the only way out I saw there is where you need to write the parsing for an entire block/object with JS, which seems kind of pointless. Thanks!

marijn · April 3, 2024, 12:01pm

Precedences are static. You can make it so that conflicting tokens don’t occur in the same positions in the grammar, and contextual tokenization will make sure the right one is picked, but I’m not sure that applies here. Another approach would be an external tokenizer that somehow (possibly via stack.canShift) figures out which token is appropriate.

kearfy · April 3, 2024, 12:03pm

Interesting. How do you let the external tokenizer hand back to “the normal grammar/parser” if it were though? Basically you’d want the external tokenizer to decide if something is a block/object but I’m stuck on how to hand it back