We look at performance of the new F# version of PersistentHashMap and compare it to the Clojure version. And in the end, declare ourselved mystified.

This is the final post in a series on PersistentHashMap. The previous posts are:

Performance

Given my level of F# experience, I’m always concerned about doing something stupid that will kill performance.
(See this serieds for an example. So I decided to do some performance testing.

I focused on several operations: assoc; assoc on a transient map; and containsKey. Construction one element at a time, efficient construction, and lookup. I coded some benchmarks. I ran them using ints for keys and using strings for keys – I wanted to make sure that the numeric comparisons were not a problem (see the link above).

Here, First is the current ClojureCLR code, Next is the F# code as described in these posts. With integer keys and various sizes of maps, the results for assoc were:

Method size Mean Ratio BranchInstructions/Op CacheMisses/Op BranchMispredictions/Op Allocated Alloc Ratio
FirstAssoc 32 4.025 us 1.00 10,315 108 45 15.28 KB 1.00
NextAssoc 32 5.637 us 1.40 14,817 142 66 15.28 KB 1.00
                 
FirstAssoc 100 13.240 us 1.00 31,093 382 205 53.45 KB 1.00
NextAssoc 100 16.556 us 1.27 43,083 209 262 53.45 KB 1.00
                 
FirstAssoc 500 90.823 us 1.00 196,351 2,112 2,161 332.38 KB 1.00
NextAssoc 500 106.330 us 1.19 256,438 1,515 2,435 332.38 KB 1.00
                 
FirstAssoc 1000 202.504 us 1.00 468,839 3,011 5,344 766.19 KB 1.00
NextAssoc 1000 239.588 us 1.20 593,876 3,469 5,925 766.19 KB 1.00

Not great. But not terrible. Note that memory allocation matches exactly. Things such as branch mispredictions are not too bad.

For transient assoc the results were:

Method size Mean Ratio BranchInstructions/Op BranchMispredictions/Op CacheMisses/Op Allocated Alloc Ratio
FirstTransientAssoc 32 3.062 us 1.00 9,494 30 51 6.07 KB 1.00
NextTransientAssoc 32 3.364 us 1.09 11,292 38 38 6.07 KB 1.00
                 
FirstTransientAssoc 50 3.809 us 1.00 12,838 37 44 7.76 KB 1.00
NextTransientAssoc 50 4.612 us 1.21 15,190 52 48 7.76 KB 1.00
                 
FirstTransientAssoc 100 6.618 us 1.00 22,187 79 70 12.51 KB 1.00
NextTransientAssoc 100 8.281 us 1.25 26,706 114 113 12.51 KB 1.00
                 
FirstTransientAssoc 500 43.971 us 1.00 126,296 916 529 69.75 KB 1.00
NextTransientAssoc 500 49.280 us 1.12 146,675 1,067 556 69.75 KB 1.00

A little better.

For lookups, I ran one benchmark where I looked up every key in the map. Then one where I looked up a key not in the map. The results were:

Method size Mean Ratio BranchInstructions/Op CacheMisses/Op BranchMispredictions/Op Allocated Alloc Ratio
FirstContainsKey 10 719.9 ns 1.00 2,256 12 5 960 B 1.00
NextContainsKey 10 1,263.1 ns 1.76 4,931 10 14 960 B 1.00
                 
FirstContainsKey 100 5,869.5 ns 1.00 19,794 67 32 9600 B 1.00
NextContainsKey 100 11,709.8 ns 1.99 46,725 94 111 9600 B 1.00
                 
FirstContainsKey 1000 68,987.0 ns 1.00 211,902 683 618 96001 B 1.00
NextContainsKey 1000 136,083.2 ns 1.97 479,782 1,154 1,278 96001 B 1.00
                 
FirstContainsKey 10000 817,806.8 ns 1.00 2,263,791 9,010 8,062 960012 B 1.00
NextContainsKey 10000 1,416,671.7 ns 1.73 4,959,326 12,955 12,030 960013 B 1.00
                 
FirstContainsKey 100000 11,913,599.8 ns 1.00 21,925,495 198,848 44,257 9600126 B 1.00
NextContainsKey 100000 19,679,065.8 ns 1.64 49,517,508 258,334 85,677 9600132 B 1.00
Method size Mean Ratio BranchInstructions/Op CacheMisses/Op BranchMispredictions/Op Allocated Alloc Ratio
FirstContainsKeyMissing 10 824.4 ns 1.00 2,752 12 4 960 B 1.00
NextContainsKeyMissing 10 988.0 ns 1.20 4,118 6 5 480 B 0.50
                 
FirstContainsKeyMissing 100 2,739.3 ns 1.00 9,576 41 10 9600 B 1.00
NextContainsKeyMissing 100 5,411.5 ns 1.98 21,266 42 18 4800 B 0.50
                 
FirstContainsKeyMissing 100000 3,035,531.8 ns 1.00 11,578,970 30,002 8,429 9600120 B 1.00
NextContainsKeyMissing 100000 10,824,464.3 ns 3.57 44,755,575 59,803 35,367 4800066 B 0.50

We have better memory allocation in the latter, but branch mispredictions and cache misses are much higher in both. And the performance ratio is really bad.

I ran the same tests for string keys. Similar results. Looking at just the time/op ratios:

Operation size int keys string keys
Assoc 32 1.40 1.12
Assoc 100 1.27 1.20
Assoc 500 1.19 1.17
Assoc 1000 1.20 1.19
       
TAssoc 32 1.09 1.10
TAssoc 50 1.21 1.20
TAssoc 100 1.25 1.16
TAssoc 500 1.12 1.11
       
CKey 10 1.76 1.83
CKey 100 1.99 1.57
CKey 1000 1.97 1.42
CKey 10000 1.73 1.45
CKey 100000 1.64 1.37
       
Miss 10 1.20 2.97
Miss 100 1.98 3.05
Miss 100000 3.57 2.45

Lookups

Looking for a key not in a map is horrendously less efficient. Now, if you look at the actual timings, we are talking about 17 ns vs 51 ns or 25ns vs 60 ns, but a few nanoseconds here and there and soon you are talking hours.

I decided to focus on containsKey. The code involved is much simpler than that for assoc.

In a first walkthrough, I found two simple coding errors. One involved our old friend = translating to complicated generic equality check. (As covered in the post mentioned above.) This one I missed because it was actually <>. Same difference. Oops. Another involved a hash function that was generic and not specialized for obj. I fixed these and reran just the lookup benchmarks on string keys. There was more improvement than I would have guessed:

Operation size int keys string keys after fixes
CKey 10 1.76 1.83 1.11
CKey 100 1.99 1.57 1.20
CKey 1000 1.97 1.42 1.09
CKey 10000 1.73 1.45 1.07
CKey 100000 1.64 1.37 1.03
         
Miss 10 1.20 2.97 1.88
Miss 100 1.98 3.05 1.97
Miss 100000 3.57 2.45 1.87

Nice improvement. But the ‘Miss’ case is still troublesome.

I went in for a comparison of the underlying code, function by function. The new F#, the F# translated to IL and decompiled to C# (thanks, ILSpy) and the original C#.

To give you the gist, here is PersistentMap.containsKey:

The F# code:

        override _.containsKey(k) =
            if isNull k then
                hasNull
            else
                (not (isNull root))
                &&  not <| LanguagePrimitives.PhysicalEquality  (root.find (0, NodeOps.hash (k), k, PersistentHashMap.notFoundValue)) PersistentHashMap.notFoundValue

The F# translated to C#:

virtual bool Associative.containsKey(object k)
{
	if (k == null)
	{
		return hasNull;
	}
	if (root != null && 0 == 0)
	{
		bool flag = root.find(0, NodeOps.hash(k), k, notFoundValue) == notFoundValue;
		return !flag;
	}
	return false;
}

The original C#:

public override bool containsKey(object key)
{
	if (key == null)
	{
		return _hasNull;
	}
	return _root != null && _root.Find(0, Hash(key), key, NotFoundValue) != NotFoundValue;
}

I don’t think we can get closer than that.

I looked at the hashing code. I looked at the implementations of find in each of the three node types. Because some branches down the call tree seemed to go to far (hashing) I ran some direct benchmarks.

For example, given that we had already worked out numeric comparison issues previously, I concentrated on string keys. We need to look at hashing and equality. String hashing takes one to the method HashStringU in the Murmur3 library. I ran a benchmark just for that:

Method size Mean Ratio
FirstHashStringU 100 3.254 us 1.00
NextHashStringU 100 3.117 us 0.96

When I looked at the code for hashing in general, the thing that calls HashStringU. Here is the F# translated to C#:

public static int hasheq(object o)
{
	if (o != null)
	{
		if (!(o is IHashEq))
		{
			object x2;
			if (o is string)
			{
				object x3 = o;
				if (!Numbers.IsNumeric(x3))
				{
					string s = (string)o;
					return Murmur3.HashString(s);
				}
				x2 = o;
			}
			else
			{
				object x = o;
				if (!Numbers.IsNumeric(x))
				{
					return o.GetHashCode();
				}
				x2 = o;
			}
			return Numbers.hasheq(x2);
		}
		IHashEq ihe = (IHashEq)o;
		return ihe.hasheq();
	}
	return 0;
}

The original C#:

public static int hasheq(object o)
{
	if (o == null)
	{
		return 0;
	}
	if (o is IHashEq ihe)
	{
		return dohasheq(ihe);
	}
	if (IsNumeric(o))
	{
		return Numbers.hasheq(o);
	}
	if (o is string s)
	{
		return Murmur3.HashString(s);
	}
	return o.GetHashCode();
}

I’m not great on these architectural issues, but I do know that branching misprediction are a thing. How are we doing? I coded up a benchmark to hash keys of various types. (Many times.)

Method size Mean Error StdDev Ratio RatioSD Gen0 BranchInstructions/Op CacheMisses/Op BranchMispredictions/Op Allocated Alloc Ratio
FirstHasheq 1000 35.61 us 0.708 us 0.695 us 1.00 0.00 1.8311 105,506 321 96 23.44 KB 1.00
NextHasheq 1000 34.68 us 0.199 us 0.186 us 0.97 0.02 1.3428 114,459 179 123 17.58 KB 0.75

Well, we are as performant and allocate less. Let’s move on.

I went down in the lookup code in each node type. We have to compare keys against keys. That is an equiv method. Might we have problems? I tested comparing keys of various types agaisnt each other. (many times)

Method size Mean Error StdDev Ratio RatioSD BranchInstructions/Op BranchMispredictions/Op CacheMisses/Op Allocated Alloc Ratio
FirstEquiv 12 3.252 us 0.0639 us 0.1417 us 1.00 0.00 11,900 30 2 - NA
NextEquiv 12 2.350 us 0.0288 us 0.0270 us 0.73 0.04 10,328 44 1 - NA

Actually, the F# version does better.

What’s next to look at?

I don’t know.

I went through every line of code that is hit doing containsKey. Or I wrote benchmarks for the smaller things that had too many moving parts to look at. The F# code translated to C# is almost identical throughout.

As a last resort, I considered the possibility that the tree structure itself was the problem. Maybe the F# code was creating a deep tree structure due to some bug I missed. This is unlikely, given that the algorithm has the exact same memory allocation as the C# version. Nevertheless, I wrote printer for PersistentHashMap in both the F# and C# versions. I ran it on maps of 10, 100, and 500 elements. The trees produced by the two versions were identical.

I don’t know what else to look at.

I’m open to suggestions.