This is the first data structure we’ve worked on where performance can reasonably be tested. How does the new F# version of the code compare to the original C#? Where can performance be improved?

I used BenchmarkDotNet to generate the comparisons below. It makes benchmarking fun.

I want to make sure I’m not making egregiously stupid errors in writing the new F# code, a concern given my lack of experience in the language. I also wonder about the performance differences of certain constructs in F#.

Transiency

The first test compares creation of PersistentVectors via transiency. That is probably the simplest code we can look at. I compared to the ClojureCLR C# code (notated as First) against the ClojureCLR.Next F# code (notated as Next for a variety of vector sizes. Sample benchmark code:

[<Benchmark>]
member this.NextTransientConj() =
    let mutable pv =
        (Clojure.Collections.PersistentVector.EMPTY :> Clojure.Collections.IEditableCollection)
            .asTransient ()

    for i in 0 .. this.size do
        pv <- pv.conj (i)

    pv.persistent ()

I’ll leave out some of the details from the BenchmarkDotNet output, but you’ll get the idea.

Method size Mean Ratio Allocated Alloc Ratio
FirstTransientConj 100 909.5 ns 1.00 4.95 KB 1.00
NextTransientConj 100 951.3 ns 1.05 5.36 KB 1.08
FirstTransientConj 1000 8,502.6 ns 1.00 43.14 KB 1.00
NextTransientConj 1000 8,191.6 ns 0.96 43.55 KB 1.01
FirstTransientConj 10000 110,798.0 ns 1.00 510.69 KB 1.00
NextTransientConj 10000 109,310.5 ns 0.99 511.09 KB 1.00
FirstTransientConj 100000 2,524,123.1 ns 1.00 5831.35 KB 1.00
NextTransientConj 100000 2,596,691.2 ns 1.03 5831.75 KB 1.00

I’ve seen a few percentage points swing between successive runs. We can call this a dead heat.

While we’re on transients, I did do a comparision of transient vs non-transient versions in the F# code only.

Method size Mean Ratio Allocated Alloc Ratio
NextTransientConj 10 238.9 ns 1.00 1.47 KB 1.00
NextCons 10 321.7 ns 1.35 2.04 KB 1.39
NextTransientConj 100 1,262.7 ns 1.00 5.36 KB 1.00
NextCons 100 3,348.6 ns 2.67 24.16 KB 4.51
NextTransientConj 1000 11,410.0 ns 1.00 43.55 KB 1.00
NextCons 1000 34,894.5 ns 3.08 240.71 KB 5.53
NextTransientConj 10000 143,966.6 ns 1.00 511.09 KB 1.00
NextCons 10000 572,406.2 ns 3.97 2494.45 KB 4.88
NextTransientConj 100000 3,120,521.6 ns 1.00 5831.75 KB 1.00
NextCons 100000 13,182,321.8 ns 4.21 25679.83 KB 4.40

No surprise that transients win. The benchmark on the transients reference page shows a similar ratio. (73.7 vs 19.7 = 3.74 ratio).

Cons

The situation is not quite as good for cons, but not enough to make one overly concerned. The benchmark is almost identical to the one for conj, just not using transients:

[<Benchmark>]
member this.NextCons() =
    let mutable pv =
        Clojure.Collections.PersistentVector.EMPTY :> Clojure.Collections.IPersistentVector

    for i in 0 .. this.size do
        pv <- pv.cons (i)

    pv
Method size Mean Ratio Allocated Alloc Ratio
FirstCons 100 2.409 us 1.00 23.76 KB 1.00
NextCons 100 2.594 us 1.07 24.16 KB 1.02
FirstCons 1000 24.471 us 1.00 240.3 KB 1.00
NextCons 1000 26.581 us 1.09 240.71 KB 1.00
FirstCons 10000 410.706 us 1.00 2494.05 KB 1.00
NextCons 10000 438.763 us 1.07 2494.45 KB 1.00
FirstCons 100000 12,322.450 us 1.00 25679.42 KB 1.00
NextCons 100000 12,389.234 us 1.01 25679.82 KB 1.00

I did come up with one hypothesis to explain some of the difference. If you look at the code for cons, the major work is done in methods pushTail and newPath. They are very simple. I looked at the IL generated from both the F# and the C# code. The IL is almost identical. With one exception. Both methods create little arrays and hook them together. The code to create an array in C#

object[] array = new object[_tail.length+1];

Once you have the size on the stack, the array allocation is just one IL instruction:

newarr [netstandard]System.Object

In the F# code we have

object[] newTail = ArrayModule.ZeroCreate<object>(tail.Length + 1);

We are told this is the preferred mechanizm for creating an array. That translates into

call !!0[] [FSharp.Core]Microsoft.FSharp.Collections.ArrayModule::ZeroCreate<object>(int32)

Does that make a difference? I looked at the assembler generated by the JIT (thanks to sharplab.io). I’m not an expert in reading the assembler generated for .NET code, but that call translates into a jump to a subroutine rather than a full-blown .NET method call. And you can find the source code for ArrayModule::ZeroCreate:

let zeroCreate count =
    if count < 0 then
        invalidArgInputMustBeNonNegative "count" count

    Microsoft.FSharp.Primitives.Basics.Array.zeroCreateUnchecked count

The code for zeroCreateUnchecked is set up to directly emit a newarr instruction when compiled.

So is the argument check plus the subroutine jump enough to make a difference over just executing newarr? Yep. I tested calls to a C# method that just calls new object[n] and an F# method that just calls Array.zeroCreate n. The results, please.

Method Mean Error StdDev Ratio RatioSD Gen0 Gen1 Allocated Alloc Ratio
CSharp 89.70 us 1.754 us 1.802 us 1.00 0.00 214.2334 0.1221 2.67 MB 1.00
FSharp 96.70 us 1.306 us 1.020 us 1.07 0.02 214.2334 0.1221 2.67 MB 1.00

That 7% difference would not account for all the difference in the cons benchmark, given all the other operations, but it is a factor. I have no idea how to get rid of this difference.

Nth

With nth the initial results were more concerning. Sample code:

[<Benchmark>]
member this.NextNth() =
    let pv = this.nextVec :> Clojure.Collections.Indexed
    let mutable acc: obj = null

    for i in 0 .. (this.size - 1) do
        acc <- pv.nth (i)

    acc

In this code, this.nextVec would be initialized during test setup to be a vector of the size being tested. The benchmark calls nth on every item in the vector. The results:

Method size Mean Ratio
FirstNth 10 34.32 ns 1.00
NextNth 10 31.59 ns 0.92
FirstNth 100 385.30 ns 1.00
NextNth 100 504.02 ns 1.31
FirstNth 1000 3,747.78 ns 1.00
NextNth 1000 4,966.74 ns 1.33
FirstNth 10000 51,141.67 ns 1.00
NextNth 10000 68,569.16 ns 1.34
FirstNth 100000 632,040.21 ns 1.00
NextNth 100000 747,017.44 ns 1.18

The F# code is faster for smaller vectors – that work is all done in the tail array, not in the index tree. As soon as we hit the index tree, performance declines dramatically.

I compared the code at both the source level and the IL level. The code for nth itself is essentially identical. The difference comes in arrayFor. The C# version:

object[] ArrayFor(int i) 
{
    if (i >= 0 && i < _cnt)
    {
        if (i >= tailoff())
            return _tail;
        Node node = _root;
        for (int level = _shift; level > 0; level -= 5)
            node = (Node)node.Array[(i >> level) & 0x01f];
        return node.Array;
    }
    throw new ArgumentOutOfRangeException("i");
}

The F# version:

member this.arrayFor(i) =
    if 0 <= i && i < cnt then
        if i >= this.tailoff () then
            tail
        else
            let rec loop (node: PVNode) level =
                if level <= 0 then
                    node.Array
                else
                    let newNode = node.Array[(i >>> level) &&& 0x1f] :?> PVNode
                    loop newNode (level - 5)
            loop root shift
    else
        raise <| ArgumentOutOfRangeException("i")

I used the standard trick of changing a loop with assignment in the body into a recursive looping function. Tail recursive code such as this is converted efficiently into a loop by the F# compiler. The code looks almost identical to the loop code in the C# version. With one difference. I loaded the F# IL into ILSpy and had it write it back out as C#:

public object[] arrayFor(int i)
{
	if (0 <= i && i < cnt)
	{
		if (i >= tailoff())
		{
			return tail;
		}
		return $PersistentVector.step@658-35(i, root, shift);
	}
	ArgumentOutOfRangeException ex = new ArgumentOutOfRangeException("i");
	throw ex;
}

The loop is off in that internal function. That loop is not inlined. We pay the overhead of a call. And that is enough.

I benchmarked loop vs recursive function code in F# when I first started writing code. I wanted to do the things stylishly. To be clear, some loops in F# can’t be written directly as loops in F# – breaking out of loops early is not supported. But some could be translated more directly by using mutable bindings. My benchmarking showed that generally the compiler generates very efficient looping code. If you look at $PersistentVector.step@658-35, that code is essentially the same as the C# loop code. However, the loop call usually cannot be inlined, so there is the overhead of calling it. That overhead is nominal provided either (a) the effort within each iteration is significant or (b) the number of iterations is very large. In this code, neither is true. Each iteration does almost nothing. And the number of iterations, even with large vector sizes, is very small – tree depth is not going be more than three. Under these circumstances, we feel the cost of the function call.

I did a benchmark just on the function call itself when the loop does almost nothing. It is consistent with the findings here.

The solution to our performance problem: just write a mutating loop in F#.

member this.arrayFor(i) =
    if 0 <= i && i < cnt then
        if i >= this.tailoff () then
            tail
        else
            let mutable node = root
            let mutable sh = shift

            while sh > 0 do
                node <- node.Array[(i >>> sh) &&& 0x1f] :?> PVNode
                sh <- sh-5
            node.Array
    else
        raise <| ArgumentOutOfRangeException("i")

The effect is notable:

Method size Mean Ratio
FirstNth 10 34.01 ns 1.00
NextNth 10 31.01 ns 0.91
FirstNth 100 385.56 ns 1.00
NextNth 100 396.11 ns 1.03
FirstNth 1000 3,763.27 ns 1.00
NextNth 1000 3,891.33 ns 1.03
FirstNth 10000 50,540.77 ns 1.00
NextNth 10000 56,087.60 ns 1.11
FirstNth 100000 628,617.81 ns 1.00
NextNth 100000 600,595.01 ns 0.95

One needs to be careful. Sometimes purity of construct will get you into trouble. But most of what I am doing so far is gaining much cleaner code (to my mind) with no significant performance hit. Onward. With caution.

End note

In the second post in this series, I promised a benchmark on the relative performance of three ways of computing the array index at given level in the tree.

    static member doMod(i) =
        let im1 = i - 1
        im1 - (im1 % 32)     

    static member doDiv(i) =
        (( i - 1) / 32) *  32
            
    static member doShift(i) =
        ((i-1) >>> 5) <<< 5

I did say the results were not going to be a surprise.

Method Mean Error StdDev Ratio RatioSD
ModOp 56.75 us 1.111 us 0.984 us 1.00 0.00
DivOp 48.54 us 0.751 us 0.703 us 0.86 0.02
BitShift 31.78 us 0.424 us 0.376 us 0.56 0.01

With that, time to leave the bench and head for the beach.