Understanding the python indent tracker hash function

tslocke · November 26, 2022, 11:00am

I am writing a lezer parser for a language with meaningful indentation. I’ve used the Python parser as a starting point. (GitHub - lezer-parser/python: A Python parser for Lezer)

I need to track a bit of extra context on top of the indent level, so I’ll need to update the hash function accordingly, and I’m confused by the current function. It 's calculated like this:

(parent ? parent.hash + parent.hash << 8 : 0) + depth + (depth << 4)

It looks to me like it’s shifting the parent’s hash up to higher bits, so that the bottom 8 bits are available for the current depth (so there is an assumption that depths will be <= 256, right?), but I would have expected just

(parent ? parent.hash + parent.hash << 8 : 0) + depth

Why is + (depth << 4) also needed?

marijn · November 26, 2022, 8:44pm

I am not very knowledgeable about hash functions, and just modeled this after an example I found online. There’s definitely no assumption of depths being less than anything, it’s just smashing together bits to get values that hopefully won’t clash.

tslocke · November 26, 2022, 10:33pm

If they do clash, is that a performance hit, or would the parser be buggy?

marijn · November 27, 2022, 12:13am

That could lead to an incorrect parse, I guess, if the stars all align in the right way and the clashing context ends up being a candidate for reuse exactly at the point where the clash happens.

superlou · January 22, 2024, 12:42pm

Is the hash to be unique for an indent level, or is it unique to a specific instance of an indent level? My understanding is that the hash is one way; you don’t decode information about the parent or level from it. Could a naive hash be like the following?

Level 0: hash = 0
    Level 1: hash = 1
        Level 2: hash = 2
Another Level 0: hash = 0
    Level 1: hash = 1

Or should it be like:

Level 0: hash = 0
    Level 1: hash = 1
        Level 2: hash = 2
Another Level 0: hash = 3
    Level 1: hash = 4

I might be completely missing the intent.

marijn · January 22, 2024, 12:56pm

It encodes the indentation level and any parent indentation levels.

superlou · January 22, 2024, 1:11pm

Hmm. But then does simply the hash value “2” for a Level 2 heading encode both its level and that there are two levels above it? Is the problem with such as simple approach the following?

Level 0.A: hash = 0
    Level 1.A: hash = 1
        Level 2.A: hash = 2
        Level 2.B: hash = 2 <-- Collision
    Level 1.B: hash = 1 <-- Collision
        Level 2.A: hash = 2

marijn · January 22, 2024, 2:29pm

Yes, if that is how the hash works, that would be bad. But it isn’t, as even the most cursory glance at the code would have showed you.

superlou · January 22, 2024, 6:21pm

Sorry, I understand the hashing algorithm provided was different than the very naive approach in my example.

I was trying to learn more about the design of the hash function, and since you had said you didn’t have information on that, I was trying to understand what IndentLevel represented and its hash-uniqueness needs. I’m not asking the question well, but does an IndentLevel need a document-unique hash, or is it “unique within parent,” or something like that? Since I’m working on parser that needs to combine whitespace and other elements for depth, I wanted to understand better what Lezer needed, and not explicitly how any given hash function was constructed.

That said, I’ll keep using the suggested hash function and will see if I hit any practical issues.

marijn · January 22, 2024, 8:35pm

It’s a hash, so it’s not guaranteed to be unique, but set up in such a way that it’s likely to be unique, for a set of indentation level. But since it only encodes a stack of depth levels, it’s going to be the same at different points of a document. And that’s fine. It is used to avoid reusing content that was parsed in a different indentation context, because that wouldn’t be valid, in an indentation-sensitive language.