We look at performance of the new F# version of PersistentHashMap and compare it to the Clojure version. And in the end, declare ourselved mystified.

This is the final post in a series on PersistentHashMap. The previous posts are:

Performance

Given my level of F# experience, I’m always concerned about doing something stupid that will kill performance.
(See this serieds for an example. So I decided to do some performance testing.

I focused on several operations: assoc; assoc on a transient map; and containsKey. Construction one element at a time, efficient construction, and lookup. I coded some benchmarks. I ran them using ints for keys and using strings for keys – I wanted to make sure that the numeric comparisons were not a problem (see the link above).

Here, First is the current ClojureCLR code, Next is the F# code as described in these posts. With integer keys and various sizes of maps, the results for assoc were:

Method	size	Mean	Ratio	BranchInstructions/Op	CacheMisses/Op	BranchMispredictions/Op	Allocated	Alloc Ratio
FirstAssoc	32	4.025 us	1.00	10,315	108	45	15.28 KB	1.00
NextAssoc	32	5.637 us	1.40	14,817	142	66	15.28 KB	1.00

FirstAssoc	100	13.240 us	1.00	31,093	382	205	53.45 KB	1.00
NextAssoc	100	16.556 us	1.27	43,083	209	262	53.45 KB	1.00

FirstAssoc	500	90.823 us	1.00	196,351	2,112	2,161	332.38 KB	1.00
NextAssoc	500	106.330 us	1.19	256,438	1,515	2,435	332.38 KB	1.00

FirstAssoc	1000	202.504 us	1.00	468,839	3,011	5,344	766.19 KB	1.00
NextAssoc	1000	239.588 us	1.20	593,876	3,469	5,925	766.19 KB	1.00

Not great. But not terrible. Note that memory allocation matches exactly. Things such as branch mispredictions are not too bad.

For transient assoc the results were:

Method	size	Mean	Ratio	BranchInstructions/Op	BranchMispredictions/Op	CacheMisses/Op	Allocated	Alloc Ratio
FirstTransientAssoc	32	3.062 us	1.00	9,494	30	51	6.07 KB	1.00
NextTransientAssoc	32	3.364 us	1.09	11,292	38	38	6.07 KB	1.00

FirstTransientAssoc	50	3.809 us	1.00	12,838	37	44	7.76 KB	1.00
NextTransientAssoc	50	4.612 us	1.21	15,190	52	48	7.76 KB	1.00

FirstTransientAssoc	100	6.618 us	1.00	22,187	79	70	12.51 KB	1.00
NextTransientAssoc	100	8.281 us	1.25	26,706	114	113	12.51 KB	1.00

FirstTransientAssoc	500	43.971 us	1.00	126,296	916	529	69.75 KB	1.00
NextTransientAssoc	500	49.280 us	1.12	146,675	1,067	556	69.75 KB	1.00

A little better.

For lookups, I ran one benchmark where I looked up every key in the map. Then one where I looked up a key not in the map. The results were:

Method	size	Mean	Ratio	BranchInstructions/Op	CacheMisses/Op	BranchMispredictions/Op	Allocated	Alloc Ratio
FirstContainsKey	10	719.9 ns	1.00	2,256	12	5	960 B	1.00
NextContainsKey	10	1,263.1 ns	1.76	4,931	10	14	960 B	1.00

FirstContainsKey	100	5,869.5 ns	1.00	19,794	67	32	9600 B	1.00
NextContainsKey	100	11,709.8 ns	1.99	46,725	94	111	9600 B	1.00

FirstContainsKey	1000	68,987.0 ns	1.00	211,902	683	618	96001 B	1.00
NextContainsKey	1000	136,083.2 ns	1.97	479,782	1,154	1,278	96001 B	1.00

FirstContainsKey	10000	817,806.8 ns	1.00	2,263,791	9,010	8,062	960012 B	1.00
NextContainsKey	10000	1,416,671.7 ns	1.73	4,959,326	12,955	12,030	960013 B	1.00

FirstContainsKey	100000	11,913,599.8 ns	1.00	21,925,495	198,848	44,257	9600126 B	1.00
NextContainsKey	100000	19,679,065.8 ns	1.64	49,517,508	258,334	85,677	9600132 B	1.00

Method	size	Mean	Ratio	BranchInstructions/Op	CacheMisses/Op	BranchMispredictions/Op	Allocated	Alloc Ratio
FirstContainsKeyMissing	10	824.4 ns	1.00	2,752	12	4	960 B	1.00
NextContainsKeyMissing	10	988.0 ns	1.20	4,118	6	5	480 B	0.50

FirstContainsKeyMissing	100	2,739.3 ns	1.00	9,576	41	10	9600 B	1.00
NextContainsKeyMissing	100	5,411.5 ns	1.98	21,266	42	18	4800 B	0.50

FirstContainsKeyMissing	100000	3,035,531.8 ns	1.00	11,578,970	30,002	8,429	9600120 B	1.00
NextContainsKeyMissing	100000	10,824,464.3 ns	3.57	44,755,575	59,803	35,367	4800066 B	0.50

We have better memory allocation in the latter, but branch mispredictions and cache misses are much higher in both. And the performance ratio is really bad.

I ran the same tests for string keys. Similar results. Looking at just the time/op ratios:

Operation	size	int keys	string keys
Assoc	32	1.40	1.12
Assoc	100	1.27	1.20
Assoc	500	1.19	1.17
Assoc	1000	1.20	1.19

TAssoc	32	1.09	1.10
TAssoc	50	1.21	1.20
TAssoc	100	1.25	1.16
TAssoc	500	1.12	1.11

CKey	10	1.76	1.83
CKey	100	1.99	1.57
CKey	1000	1.97	1.42
CKey	10000	1.73	1.45
CKey	100000	1.64	1.37

Miss	10	1.20	2.97
Miss	100	1.98	3.05
Miss	100000	3.57	2.45

Lookups

Looking for a key not in a map is horrendously less efficient. Now, if you look at the actual timings, we are talking about 17 ns vs 51 ns or 25ns vs 60 ns, but a few nanoseconds here and there and soon you are talking hours.

I decided to focus on containsKey. The code involved is much simpler than that for assoc.

In a first walkthrough, I found two simple coding errors. One involved our old friend = translating to complicated generic equality check. (As covered in the post mentioned above.) This one I missed because it was actually <>. Same difference. Oops. Another involved a hash function that was generic and not specialized for obj. I fixed these and reran just the lookup benchmarks on string keys. There was more improvement than I would have guessed:

Operation	size	int keys	string keys	after fixes
CKey	10	1.76	1.83	1.11
CKey	100	1.99	1.57	1.20
CKey	1000	1.97	1.42	1.09
CKey	10000	1.73	1.45	1.07
CKey	100000	1.64	1.37	1.03

Miss	10	1.20	2.97	1.88
Miss	100	1.98	3.05	1.97
Miss	100000	3.57	2.45	1.87

Nice improvement. But the ‘Miss’ case is still troublesome.

I went in for a comparison of the underlying code, function by function. The new F#, the F# translated to IL and decompiled to C# (thanks, ILSpy) and the original C#.

To give you the gist, here is PersistentMap.containsKey:

The F# code:

        override _.containsKey(k) =
            if isNull k then
                hasNull
            else
                (not (isNull root))
                &&  not <| LanguagePrimitives.PhysicalEquality  (root.find (0, NodeOps.hash (k), k, PersistentHashMap.notFoundValue)) PersistentHashMap.notFoundValue

The F# translated to C#:

virtual bool Associative.containsKey(object k)
{
	if (k == null)
	{
		return hasNull;
	}
	if (root != null && 0 == 0)
	{
		bool flag = root.find(0, NodeOps.hash(k), k, notFoundValue) == notFoundValue;
		return !flag;
	}
	return false;
}

The original C#:

public override bool containsKey(object key)
{
	if (key == null)
	{
		return _hasNull;
	}
	return _root != null && _root.Find(0, Hash(key), key, NotFoundValue) != NotFoundValue;
}

I don’t think we can get closer than that.

I looked at the hashing code. I looked at the implementations of find in each of the three node types. Because some branches down the call tree seemed to go to far (hashing) I ran some direct benchmarks.

For example, given that we had already worked out numeric comparison issues previously, I concentrated on string keys. We need to look at hashing and equality. String hashing takes one to the method HashStringU in the Murmur3 library. I ran a benchmark just for that:

Method	size	Mean	Ratio
FirstHashStringU	100	3.254 us	1.00
NextHashStringU	100	3.117 us	0.96

When I looked at the code for hashing in general, the thing that calls HashStringU. Here is the F# translated to C#:

public static int hasheq(object o)
{
	if (o != null)
	{
		if (!(o is IHashEq))
		{
			object x2;
			if (o is string)
			{
				object x3 = o;
				if (!Numbers.IsNumeric(x3))
				{
					string s = (string)o;
					return Murmur3.HashString(s);
				}
				x2 = o;
			}
			else
			{
				object x = o;
				if (!Numbers.IsNumeric(x))
				{
					return o.GetHashCode();
				}
				x2 = o;
			}
			return Numbers.hasheq(x2);
		}
		IHashEq ihe = (IHashEq)o;
		return ihe.hasheq();
	}
	return 0;
}

The original C#:

public static int hasheq(object o)
{
	if (o == null)
	{
		return 0;
	}
	if (o is IHashEq ihe)
	{
		return dohasheq(ihe);
	}
	if (IsNumeric(o))
	{
		return Numbers.hasheq(o);
	}
	if (o is string s)
	{
		return Murmur3.HashString(s);
	}
	return o.GetHashCode();
}

I’m not great on these architectural issues, but I do know that branching misprediction are a thing. How are we doing? I coded up a benchmark to hash keys of various types. (Many times.)

Method	size	Mean	Error	StdDev	Ratio	RatioSD	Gen0	BranchInstructions/Op	CacheMisses/Op	BranchMispredictions/Op	Allocated	Alloc Ratio
FirstHasheq	1000	35.61 us	0.708 us	0.695 us	1.00	0.00	1.8311	105,506	321	96	23.44 KB	1.00
NextHasheq	1000	34.68 us	0.199 us	0.186 us	0.97	0.02	1.3428	114,459	179	123	17.58 KB	0.75

Well, we are as performant and allocate less. Let’s move on.

I went down in the lookup code in each node type. We have to compare keys against keys. That is an equiv method. Might we have problems? I tested comparing keys of various types agaisnt each other. (many times)

Method	size	Mean	Error	StdDev	Ratio	RatioSD	BranchInstructions/Op	BranchMispredictions/Op	CacheMisses/Op	Allocated	Alloc Ratio
FirstEquiv	12	3.252 us	0.0639 us	0.1417 us	1.00	0.00	11,900	30	2	-	NA
NextEquiv	12	2.350 us	0.0288 us	0.0270 us	0.73	0.04	10,328	44	1	-	NA

Actually, the F# version does better.

What’s next to look at?

I don’t know.

I went through every line of code that is hit doing containsKey. Or I wrote benchmarks for the smaller things that had too many moving parts to look at. The F# code translated to C# is almost identical throughout.

As a last resort, I considered the possibility that the tree structure itself was the problem. Maybe the F# code was creating a deep tree structure due to some bug I missed. This is unlikely, given that the algorithm has the exact same memory allocation as the C# version. Nevertheless, I wrote printer for PersistentHashMap in both the F# and C# versions. I ran it on maps of 10, 100, and 500 elements. The trees produced by the two versions were identical.

I don’t know what else to look at.

I’m open to suggestions.