Wherein I look at hashing and equality in Clojure.

To get started, it helps to have some familiarity with the interaction of equality with collections and numbers. There is an excellent Clojure guide titled Equality that you should take a look at. The summary might suffice. I won’t quote it here.

## A code-based approach

Perhaps the most useful fruitful approach would be to look at the code from the top down. Starting in the Clojure source, from `core.clj`:

``````;equiv-based
(defn =
"Equality. Returns true if x equals y, false if not. Same as
Java x.equals(y) except it also works for nil, and compares
numbers and collections in a type-independent manner.  Clojure's immutable data
structures define equals() (and thus =) as a value, not an identity,
comparison."
{:inline (fn [x y] `(. clojure.lang.Util equiv ~x ~y))
:inline-arities #{2}
([x] true)
([x y] (clojure.lang.Util/equiv x y))
([x y & more]
(if (clojure.lang.Util/equiv x y)
(if (next more)
(recur y (first more) (next more))
(clojure.lang.Util/equiv y (first more)))
false)))
``````

Of interest, this is followed by a commented out version that is based on a different method in `Util`

``````;equals-based
#_(defn =
"Equality. Returns true if x equals y, false if not. Same as Java
x.equals(y) except it also works for nil. Boxed numbers must have
same type. Clojure's immutable data structures define equals() (and
thus =) as a value, not an identity, comparison."
{:inline (fn [x y] `(. clojure.lang.Util equals ~x ~y))
:inline-arities #{2}
([x] true)
([x y] (clojure.lang.Util/equals x y))
([x y & more]
(if (= x y)
(if (next more)
(recur y (first more) (next more))
(= y (first more)))
false)))
``````

I leave you to contemplate the change in the comment. There are some clues toward the bottom of the Equality document.

TThe code difference is between calling `Util.equals` and `Util.equiv`. They didn’t just leave `=` alone and rewrite `Util.equals`. They wrote `Util.equiv` instead. As it turns out, `Util.equals` is still there and is used in Clojure code, in both Java/C# and Clojure.

So let’s take a look in `Util`. I’ll go with the C# version.

``````static public bool equiv(object k1, object k2)
{
if (k1 == k2)
return true;
if (k1 != null)
{
if (IsNumeric(k1) && IsNumeric(k2))
return Numbers.equal(k1, k2);

else if (k1 is IPersistentCollection || k2 is IPersistentCollection)
return pcequiv(k1, k2);
return k1.Equals(k2);
}
return false;
}

public static bool equals(object k1, object k2)
{
if (k1 == k2)
return true;
return k1 != null && k1.Equals(k2);
}
``````

Historical note: before the big change-over that led to the introduction of `equiv`, `equals` looked like this:

``````static public bool equals(object k1, object k2)
{
if (k1 == k2)
return true;
if (k1 != null)
{
if (IsNumeric(k1) && IsNumeric(k2))
return Numbers.equal(k1, k2);

return k1.Equals(k2);
}
return false;
``````

In other words, the older version of `equals` is just the new `equiv` without a special case for objects that are `IPersistentCollection`. The historical notes in Equality again give clues. You can conclude that a point had been reached where the `Equals` method on Clojure collections was no longer capable of doing the work.

Back to `equiv`. The assertions about equality made in the summary become obligations on `pcequiv` for Clojure collections and on our implementations of `Equals` for everything else.

`pcequiv` is straightforward:

``````public static bool pcequiv(object k1, object k2)
{
if (k1 is IPersistentCollection ipc1)
return ipc1.equiv(k2);
return ((IPersistentCollection)k2).equiv(k1);
}
``````

In other words, whichever argument is a `IPersistentCollection`, use its `equiv` method. So equality testing now has moved to the collection’s `equiv` versus its `Equals`. Let’s compare these for a typical collection. Let’s try `PersistentVector`; these methods are defined for it in its base class, `APersistentVector`. I’m going to switch the Java code here because there is an important point illustrated there that is not in the C# version (for reasons that will become apparent in a bit):

``````public boolean equals(Object obj){
if(obj == this)
return true;
return doEquals(this, obj);
}

public boolean equiv(Object obj){
if(obj == this)
return true;
return doEquiv(this, obj);
}
``````

So the entry points here handle reference equality and then delegate for content-based comparison. Starting with

``````static boolean doEquals(IPersistentVector v, Object obj){
if(obj instanceof IPersistentVector)
{
IPersistentVector ov = (IPersistentVector) obj;
if(ov.count() != v.count())
return false;
for(int i = 0;i< v.count();i++)
{
if(!Util.equals(v.nth(i), ov.nth(i)))
return false;
}
return true;
}
else if(obj instanceof List)
{
Collection ma = (Collection) obj;
if(ma.size() != v.count() || ma.hashCode() != v.hashCode())
return false;
for(Iterator i1 = ((List) v).iterator(), i2 = ma.iterator();
i1.hasNext();)
{
if(!Util.equals(i1.next(), i2.next()))
return false;
}
return true;
}
else
{
if(!(obj instanceof Sequential))
return false;
ISeq ms = RT.seq(obj);
for(int i = 0; i < v.count(); i++, ms = ms.next())
{
if(ms == null || !Util.equals(v.nth(i), ms.first()))
return false;
}
if(ms != null)
return false;
}

return true;

}
``````

Fairly straightforward. `doEquiv` looks very similar. We can take advantage that the first argument is known to be an IPersistentVector, allowing us to use access by index rather than using an iterator.

``````static boolean doEquiv(IPersistentVector v, Object obj){
if(obj instanceof IPersistentVector)
{
IPersistentVector ov = (IPersistentVector) obj;
if(ov.count() != v.count())
return false;
for(int i = 0;i< v.count();i++)
{
if(!Util.equiv(v.nth(i), ov.nth(i)))
return false;
}
return true;
}
else if(obj instanceof List)
{
Collection ma = (Collection) obj;

if((!(ma instanceof IPersistentCollection) || (ma instanceof Counted)) && (ma.size() != v.count()))
return false;

Iterator i2 = ma.iterator();

for(Iterator i1 = ((List) v).iterator(); i1.hasNext();)
{
if(!i2.hasNext() || !Util.equiv(i1.next(), i2.next()))
return false;
}
return !i2.hasNext();
}
else
{
if(!(obj instanceof Sequential))
return false;
ISeq ms = RT.seq(obj);
for(int i = 0; i < v.count(); i++, ms = ms.next())
{
if(ms == null || !Util.equiv(v.nth(i), ms.first()))
return false;
}
if(ms != null)
return false;
}

return true;

}
``````

There is one major clue in that code. The early escape clause in the `IList` section for `doEquals` is:

``````		if(ma.size() != v.count() || ma.hashCode() != v.hashCode())
return false;
``````

while in `doEquiv` we see:

``````		if((!(ma instanceof IPersistentCollection)
|| (ma instanceof Counted)) && (ma.size() != v.count()))
return false;
``````

A size comparison in both. But only `doEquals` looks at hashcode equality. And therein lies the historical tale. There was a need to improve the distribution of hash codes for collections. Because many Clojure collections implement `java.util.List`, to fit into the Java eco-system, they needed to follow the contract for hash codes for that interface:

`int hashCode()`

Returns the hash code value for this list. The hash code of a list is defined to be the result of the following calculation:

``````        hashCode = 1;
Iterator i = list.iterator();
while (i.hasNext()) {
Object obj = i.next();
hashCode = 31*hashCode + (obj==null ? 0 : obj.hashCode());
}
``````

Thus, `APersistentVector.hashCode()` at its core is:

``````hash = 1;
for(int i = 0;i<count();i++)
{
Object obj = nth(i);
hash = 31 * hash + (obj == null ? 0 : obj.hashCode());
}
this._hash = hash;
``````

The hashcode that will be used internally in Clojure, for things such as map key calculations, is `hasheq`. Its core is

``````int n;
hash = 1;

for(n=0;n<count();++n)
{
hash = 31 * hash + Util.hasheq(nth(n));
}

this._hasheq = hash = Murmur3.mixCollHash(hash, n);
``````

Same combining of element hashes, but some extra mixing using a Murmur3 method to improve hashcode distribution.

For maps, the case is similar. The Java spec for `java.util.Map.hashCode()` tells us:

Returns the hash code value for this map. The hash code of a map is defined to be the sum of the hashCodes of each entry in the map’s entrySet view. This ensures that `t1.equals(t2)` implies that `t1.hashCode()==t2.hashCode()` for any two maps `t1` and `t2`, as required by the general contract of `Object.hashCode`.

Thus, for `APersistentMap.hashCode()` in Clojure:

``````int hash = 0;
for(ISeq s = m.seq(); s != null; s = s.next())
{
Map.Entry e = (Map.Entry) s.first();
hash += (e.getKey() == null ? 0 : e.getKey().hashCode()) ^
(e.getValue() == null ? 0 : e.getValue().hashCode());
}
``````

while for `APersistentMap.hasheq()`, Murmur3 to the rescue again:

``````Murmur3.hashUnordered(m)
``````

There were other factors in the change from the older `Equals/hashCode` approach to using `equiv/hasheq`, most prominent being the need to deal with numbers more effectively and consistently. That is the topic of the next post.

I hope this explains some of the code we will see later on. The constraints on `Equals/hashCode` and `equiv/hasheq` need to be kept in mind going forward.

Note: When I first ported the Java code to C#, I did a lot of straight copying, then modifying the code to get rid of compiler errors. Not too much thinking required. When I began testing, I was getting failures on some of the equality tests on collections. Not surprising, given that hashcode check – totally different regimen for dealing with hashcodes in `System.Collections.*`. (As in there is no regimen.) So I had to remove the hashcode short-circuits.

When `equiv/hasheq` came into the picture, I just put in the changes. Although I needed to keep the `Equals`/`equiv` part the same, I’m guessing there was no need to keep around two different hash codes. Something else to consider as we make our way through the data structure rewrites.