GetHashCode, a part of the Object class, is one of the key method present in every class instance. Its main purpose is to calculate and return a number associated with the specified object, which will be used as hash (a very good example is Dictionary class). This article will be split into two parts: the first one will focus on the reference types and the second one on the value types. We will focus on the internal implementation of CLR and try to figure out how exactly the hash code for reference and value types is generated.
Let’s start from the beginning: what characteristics should have a hash function to be useful for programmers. According to documentation of GetHashCode method, we can point three the most important:
Every class inherits (implicit or explicit) from Object class, which guarantees that you will have a set of methods (GetHashCode, Equals, GetType and ToString) available in every instance. What’s more, .NET provides a default implementation for all of them, which means that you can use them without overriding and making custom algorithms.
The theory shown above sounds cool, but things are going to be a little complicated. In .NET, we can distinguish two basic types which determine how an object behaves: reference and value types. Let’s talk about reference types first.
One of the popular myths says that hash code in reference types is directly related to its address in memory. It’s false! Yes, it doesn’t depend from the fields within instance due to performance and practical reasons - if you have Person
class with Age, then changing this member won’t change hash code unless you provide your implementation of GetHashCode. It’s really important to remember, that garbage collector not only deletes unused objects but also compacts rest of the entries reduce holes between them - which means that there is always a chance that object will change its address (C# provides a fixed keyword which locks object’s position in memory). In this situation, basing on this during generating hash code would be pointless.
Most of the articles don’t explain how exactly hash code is generated, and what formula is used - we will try to find out this. First, let’s check what is the implementation of RuntimeHelpers.GetHashCode method (used directly by default in Object.GetHashCode):
|
|
The MethodImplAttribute attribute indicates that the implementation can be found inside CLR - to find out how hash code is generated for reference types, we will dig into CLR source code. First, let’s check how the internal method is mapped into the C++ source. We can find it in /src/coreclr/src/vm/ecalllist.h file, which can be described as the bridge between C# internal calls and C++ functions. Let’s find our method:
|
|
This line indicates, that the GetHashCode
is mapped into ObjectNative::GetHashCode
function. We can find it in /src/coreclr/src/classlibnative/bcltype/objectnative.cpp:
|
|
This is quite a nice piece of code, so let’s stop for a moment. First, we need to know that every instance in .NET contains an internal field called “SyncBlock Index” - this is the index of the entry in internal CLR array, which contains things required for proper work in a multithread environment. By default, “SyncBlock Index” is set to -1 which means that the object isn’t used for synchronization. If we will use Monitor.Enter with some object as a parameter, then a proper entry in the SyncBlock list will be created and associated.
Now, look at the code above. First, the CLR reads the header of instance and checks (bits & BIT_SBLK_IS_HASH_OR_SYNCBLKINDEX
) if the value of SyncBlock index targets entry in SyncBlock list or contains hashcode. The second one can be a bit misleading, but it’s a smart optimization which allows reducing memory usage by using this field to store hash code if there is no any SyncBlock associated with an object.
The second condition (bits & BIT_SBLK_IS_HASHCODE
) is more specific and checks if the SyncBlock index contains proper hash code. If yes, we can just return the value (because it means that something earlier generated and stored it). Otherwise, it means that there is already some SyncBlock entry created and associated with an object, so we can’t read or store hash code here. In this case, hash code is stored inside SyncBlock entry to avoid conflict and that’s why CLR reads entry using SyncBlock *psb = objRef->PassiveGetSyncBlock();
. Then, hash code is read with psb->GetHashCode()
and returned.
If the first condition (bits & BIT_SBLK_IS_HASH_OR_SYNCBLKINDEX
) is false, then it means that hash code doesn’t exist for the particular object and has to be generated. This is done by FC_INNER_RETURN(INT32, GetHashCodeHelper(objRef));
line, where we can spot the GetHashCodeHelper
call. Let’s see what’s inside:
|
|
This function is in fact just a objRef->GetHashCodeEx()
call - rest of the code is used by internal CLR mechanisms. We can found it in /src/vm/object.cpp:
|
|
This is in fact just an extended version of the previously described ObjectNative::GetHashCode, with more checks to ensure thread safety. The most interesting line is hashCode = ComputeHashCode();
, which calls the function to calculate new hash code - the part we are looking for.
|
|
Let’s anaylze what does hashCode = GetThread()->GetNewHashCode() >> (32-HASHCODE_BITS);
mean. First, we see that hash code is generated using the specified thread - which means that instance X will have different hash code depending on thread which generated it. Second, there is some misterious bit shift using HASHCODE_BITS
constant, defined in /src/vm/syncblk.h
|
|
Authors of CLR decided that 26-bit hash code will be even distribution, which is one of the key hash code features. GetNewHashCode
is a part of the Thread
class and can be found in /src/coreclr/src/vm/threads.h:
|
|
And here it is! Our final formula is quite simple - just a multiplication of thread ID with a random number (thanks to linear congruential generator). It’s very effective and fast algorithm to generate new hash codes, thus ensuring even distribution. m_dwHashCodeSeed
is initialized in /src/vm/threads.cpp and looks as follows:
|
|
The last unknown, dwHashCodeSeed
is a static variable that contains the constant seed for linear congruential generator.
|
|
Here, our short journey ends. We saw how a simple GetHashCode call runs whole CLR mechanisms to generate a hash code for a reference type. Next time, we will try to do the same thing but for value types - they are much more complex and use a set of algorithms to calculate hash code, depending on the object type or fields inside a structure.