Understanding the python indent tracker hash function

I am writing a lezer parser for a language with meaningful indentation. I’ve used the Python parser as a starting point. (GitHub - lezer-parser/python: A Python parser for Lezer)

I need to track a bit of extra context on top of the indent level, so I’ll need to update the hash function accordingly, and I’m confused by the current function. It 's calculated like this:

(parent ? parent.hash + parent.hash << 8 : 0) + depth + (depth << 4)

It looks to me like it’s shifting the parent’s hash up to higher bits, so that the bottom 8 bits are available for the current depth (so there is an assumption that depths will be <= 256, right?), but I would have expected just

(parent ? parent.hash + parent.hash << 8 : 0) + depth

Why is + (depth << 4) also needed?

I am not very knowledgeable about hash functions, and just modeled this after an example I found online. There’s definitely no assumption of depths being less than anything, it’s just smashing together bits to get values that hopefully won’t clash.

If they do clash, is that a performance hit, or would the parser be buggy?

That could lead to an incorrect parse, I guess, if the stars all align in the right way and the clashing context ends up being a candidate for reuse exactly at the point where the clash happens.

Is the hash to be unique for an indent level, or is it unique to a specific instance of an indent level? My understanding is that the hash is one way; you don’t decode information about the parent or level from it. Could a naive hash be like the following?

Level 0: hash = 0
    Level 1: hash = 1
        Level 2: hash = 2
Another Level 0: hash = 0
    Level 1: hash = 1

Or should it be like:

Level 0: hash = 0
    Level 1: hash = 1
        Level 2: hash = 2
Another Level 0: hash = 3
    Level 1: hash = 4

I might be completely missing the intent.

It encodes the indentation level and any parent indentation levels.

Hmm. But then does simply the hash value “2” for a Level 2 heading encode both its level and that there are two levels above it? Is the problem with such as simple approach the following?

Level 0.A: hash = 0
    Level 1.A: hash = 1
        Level 2.A: hash = 2
        Level 2.B: hash = 2 <-- Collision
    Level 1.B: hash = 1 <-- Collision
        Level 2.A: hash = 2

Yes, if that is how the hash works, that would be bad. But it isn’t, as even the most cursory glance at the code would have showed you.

Sorry, I understand the hashing algorithm provided was different than the very naive approach in my example.

I was trying to learn more about the design of the hash function, and since you had said you didn’t have information on that, I was trying to understand what IndentLevel represented and its hash-uniqueness needs. I’m not asking the question well, but does an IndentLevel need a document-unique hash, or is it “unique within parent,” or something like that? Since I’m working on parser that needs to combine whitespace and other elements for depth, I wanted to understand better what Lezer needed, and not explicitly how any given hash function was constructed.

That said, I’ll keep using the suggested hash function and will see if I hit any practical issues.

It’s a hash, so it’s not guaranteed to be unique, but set up in such a way that it’s likely to be unique, for a set of indentation level. But since it only encodes a stack of depth levels, it’s going to be the same at different points of a document. And that’s fine. It is used to avoid reusing content that was parsed in a different indentation context, because that wouldn’t be valid, in an indentation-sensitive language.