Why does parse tree contain nodes with negative ranges when there is an error?

shche · October 3, 2023, 2:20am

Hello,
The code below when given an invalid input outputs node(s) with negative ranges. I could not find any documentation on this. Would you please explain this behaviour?

@top Top { Expr }
@skip { space }

Symbol { "a" }
Expr { "(" Symbol ")" }

@tokens {
  space { " "+ }
}

The code:

import {parser} from './lang';

const input = '( )';
const tree = parser.parse(input);

console.log(input, '\n');

let cursor = tree.cursor();

do {
    const {name, type, from, to} = cursor;
    const caption = type.isError ? 'ERROR' : 'NODE';

    console.log(`${caption}: ${name} [${from}-${to}]`);
} while (cursor.next());

The output:

( ) 

NODE: Top [0-3]
NODE: Expr [0-3]
NODE: Symbol [2-1]
ERROR: ⚠ [2-2]

In addition, when I make the Expr non-terminal lower-case, the output changes:

( ) 

NODE: Top [0-3]
NODE: Symbol [2-65537]
ERROR: ⚠ [2-2]

I know that capitalization is used to make non-terminals visible in the tree, but why does it affect the ranges?

Thank you in advance!
konstantin

marijn · October 3, 2023, 7:57am

That was not intentional behavior. Attached patch should fix it.

napsta32 · October 3, 2023, 5:26pm

I was checking this fix and I see that we have now zero-length tokens:

@top Top { expr }
@skip { Space }

Symbol1 { "a" }
Symbol2 { "b" }
expr { "(" ( Symbol1 | Symbol2 ) ")" }

@tokens {
  Space { " "+ }
}

Input: ( )

NODE: Top [0-3]
NODE: Expr [0-3]
NODE: Symbol1 [2-2]
ERROR: ⚠ [2-2]

It is also a bit odd that “Symbol1” appears in this example (since both “Symbol1” and “Symbol2” are equally valid candidates). I can understand that this is part of the expected behavior though.

napsta32 · October 3, 2023, 8:00pm

@marijn it would be helpful if you could clarify why these extra nodes appear. We see that they are not error nodes, they are just regular nodes. Should we identify them by checking if their range is zero or not (the range will always be zero for these “extra” nodes)?

marijn · October 3, 2023, 9:04pm

The parser moved past the parse error by inserting a token (marked as an error node). In your example, that node was an "a", allowing it to reduce a symbol and continue its expression.

shche · October 3, 2023, 9:10pm

Thanks.

When there is an error, in addition to inserting new tokens, can Lezer do deletions or replacement of existing tokens to be able to resynchronize?
How can we tell apart nodes that resulted from tokens that were inserted/deleted/changed by the parser and nodes that were generated from the actual input on successful shift/reduce operations?

konstantin

marijn · October 3, 2023, 9:31pm

There are 4 things the parser may do when it cannot parse normally: insert a made-up token, delete the next token, force the current rule to end (skipping out of it), or restarting the entire parse (when it has hit the parse end state).

Nodes that contain inserted tokens are not marked specially except that they will contain an error node.

shche · October 3, 2023, 9:42pm

When a token is deleted, will we also get an error node? Will there also be another node pointing to the deleted input?

marijn · October 3, 2023, 10:11pm

The deleted token will be wrapped in an error node.