Global position for external tokenizers

napsta32 · October 13, 2023, 1:16am

By reading the documentation I understand that the InputStream pos is not global and is relative the the fragment the parser is parsing (Lezer Reference Manual). Is this correct?
Is it possible to obtain the global position (relative to the complete input) to avoid having to work with this interface?

marijn · October 13, 2023, 7:02am

Why do you want the global position? The amount of things you can do in a tokenizer without breaking incremental re-parsing is rather limited, and the input stream abstraction tries to prevent you from going wrong there.

napsta32 · October 17, 2023, 10:23pm

I was wondering because we have a complex tokenizer that we would not want to modify it much. It uses pure javascript strings as input and this new interface changes things.

Also we make use of the input.substring(start) (copy all the input starting at start) which is not efficient but is part of our current tokenizer. We would prerfer using pure strings instead of the interface (the InputStream will force us to do a while loop with String.fromCharCode()).

marijn · October 18, 2023, 6:32am

InputStream.pos does refer to a global position in the whole input. It sounds like the docs you linked explicitly mention this.

shche · December 6, 2023, 6:49pm

Hello Marijn, would you explain why InputStream interface includes pos at all? What would be a possible use for that, given that we do not have the original input, and the only means to get anything from the input is to use peek and advance?
thanks
konstantin

marijn · December 7, 2023, 10:46am

The XML mode uses it as a cache key to avoid re-reading tag names, and some other external tokenizers use it to check whether they’ve moved the input stream, but indeed, it’s not super useful.