Performance of the different ways to swap two values

Page content

Swapping values is probably one of the simplest algorithms which can be imagined - we learn about it when starting our programming story. There are two popular ways to accomplish this: using a temporary variable and XORing (with some restrictions). In the newest C# versions there is also a third way, about which you can read in this article.

Theory

A few weeks ago, one of my friends sent me this tweet from which we will start:

What do we see here is the usage of the value tuples syntax (introduced in the C# 7) to swap two values. As the description says, first we create a tuple with our two variables (on the right side of the assignment) and then it’s instantly deconstructed into the second tuple (on the left side of the assignment) but in reversed order. It was quite surprising as I never thought about using tuples this way, but at the same time, I was curious if this solution has some performance overhead.

On the one side, we use here ValueTuple struct, so it’s not an overhead for garbage collector - values are stored on the stack and will be just wiped after exiting from the method. On the other hand, we don’t know if the compiler will generate some extra IL instructions compared to the traditional swap using a temporary variable. Let’s check it on the simple example.

Benchmark

As in the previous articles, I used BenchmarkDotNet library which is extremally handful when measuring performance and checking output generated by the JIT. Our benchmark is very simple and swaps two variables utilizing three different methods: through temporary variable, through XOR operations, and tuples (like on the tweet above). Additional assignments on the beginning and the end of each test are made to ensure, that compiler/JIT will make these operations using registers to maximize performance.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

namespace SwapBenchmark
{
    [DisassemblyDiagnoser]
    [SimpleJob(RuntimeMoniker.NetCoreApp31)]
    public class SwapTest
    {
        private static volatile int A = 10;
        private static volatile int B = 20;

        [Benchmark]
        public void SwapTraditional()
        {
            var a = A;
            var b = B;

            var temp = a;
            a = b;
            b = temp;

            A = a;
            B = b;
        }

        [Benchmark]
        public void SwapXor()
        {
            var a = A;
            var b = B;

            a ^= b;
            b ^= a;
            a ^= b;

            A = a;
            B = b;
        }

        [Benchmark]
        public void SwapTuples()
        {
            var a = A;
            var b = B;

            (a, b) = (b, a);

            A = a;
            B = b;
        }
    }

    class Program
    {
        public static void Main(string[] args)
        {
            BenchmarkRunner.Run<SwapTest>();
        }
    }
}

Let’s check the IL output first. Our first method SwapTraditional is straightforward and in fact uses only four instructions:

  • ldsfld - loads static field onto the stack.
  • stsfld - pops variable from the stack and stores it into the static field.
  • ldloc.x - loads local variable and stores it onto the stack at the specified index (x).
  • stloc.x - pops variable from the stack and stores it into the local variables list at the specified index (x).
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
  .method public hidebysig instance void
    SwapTraditional() cil managed
  {
    .custom instance void [BenchmarkDotNet.Annotations]BenchmarkDotNet.Attributes.BenchmarkAttribute::.ctor()
      = (01 00 00 00 )
    .maxstack 2
    .locals init (
      [0] int32 a,
      [1] int32 temp
    )

    // [17 13 - 17 23]
    IL_0000: volatile.
    IL_0002: ldsfld       int32 modreq ([System.Runtime]System.Runtime.CompilerServices.IsVolatile) SwapBenchmark.SwapTest::A
    IL_0007: stloc.0      // a

    // [18 13 - 18 23]
    IL_0008: volatile.
    IL_000a: ldsfld       int32 modreq ([System.Runtime]System.Runtime.CompilerServices.IsVolatile) SwapBenchmark.SwapTest::B

    // [20 13 - 20 26]
    IL_000f: ldloc.0      // a
    IL_0010: stloc.1      // temp

    // [21 13 - 21 19]
    IL_0011: stloc.0      // a

    // [22 13 - 22 22]
    IL_0012: ldloc.1      // temp

    // [24 13 - 24 19]
    IL_0013: ldloc.0      // a
    IL_0014: volatile.
    IL_0016: stsfld       int32 modreq ([System.Runtime]System.Runtime.CompilerServices.IsVolatile) SwapBenchmark.SwapTest::A

    // [25 13 - 25 19]
    IL_001b: volatile.
    IL_001d: stsfld       int32 modreq ([System.Runtime]System.Runtime.CompilerServices.IsVolatile) SwapBenchmark.SwapTest::B

    // [26 9 - 26 10]
    IL_0022: ret

  } // end of method SwapTest::SwapTraditional

We can see the sequence of the following operations: first, the compiler loads static field A onto the stack and then pops it into the local variable a. Static field B is also loaded in a similar way, but it stays in the stack. Then, the local variable a is loaded onto the stack and immediately popped into the temp. Just after it, the value of the static field B (which has been loaded earlier) is assigned to the a. At this point, the local variable a contains value of the static field B, and the local variable temp contains the value of the a. The last operations are loading both of them onto the stack and popping into A and B.

Now time for the second test, which uses XOR operations. The compiler uses here the new instruction xor which does exactly what the name indicates - pops two variables from the stack, XORs them and stores the result again on the stack.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
  .method public hidebysig instance void
    SwapXor() cil managed
  {
    .custom instance void [BenchmarkDotNet.Annotations]BenchmarkDotNet.Attributes.BenchmarkAttribute::.ctor()
      = (01 00 00 00 )
    .maxstack 2
    .locals init (
      [0] int32 a,
      [1] int32 b
    )

    // [31 13 - 31 23]
    IL_0000: volatile.
    IL_0002: ldsfld       int32 modreq ([System.Runtime]System.Runtime.CompilerServices.IsVolatile) SwapBenchmark.SwapTest::A
    IL_0007: stloc.0      // a

    // [32 13 - 32 23]
    IL_0008: volatile.
    IL_000a: ldsfld       int32 modreq ([System.Runtime]System.Runtime.CompilerServices.IsVolatile) SwapBenchmark.SwapTest::B
    IL_000f: stloc.1      // b

    // [34 13 - 34 20]
    IL_0010: ldloc.0      // a
    IL_0011: ldloc.1      // b
    IL_0012: xor
    IL_0013: stloc.0      // a

    // [35 13 - 35 20]
    IL_0014: ldloc.1      // b
    IL_0015: ldloc.0      // a
    IL_0016: xor
    IL_0017: stloc.1      // b

    // [36 13 - 36 20]
    IL_0018: ldloc.0      // a
    IL_0019: ldloc.1      // b
    IL_001a: xor
    IL_001b: stloc.0      // a

    // [38 13 - 38 19]
    IL_001c: ldloc.0      // a
    IL_001d: volatile.
    IL_001f: stsfld       int32 modreq ([System.Runtime]System.Runtime.CompilerServices.IsVolatile) SwapBenchmark.SwapTest::A

    // [39 13 - 39 19]
    IL_0024: ldloc.1      // b
    IL_0025: volatile.
    IL_0027: stsfld       int32 modreq ([System.Runtime]System.Runtime.CompilerServices.IsVolatile) SwapBenchmark.SwapTest::B

    // [40 9 - 40 10]
    IL_002c: ret

  } // end of method SwapTest::SwapXor

The thrid test is finally our tuple-based swap. Surprisingly, it’s quite similar to the first one (where the temporary variable was used). First, both static fields A and B are loaded into the local variables a and b, then we have a series of ldloc and stloc instructions which also use a temporary variable called V_2 to swap values.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
  .method public hidebysig instance void
    SwapTuples() cil managed
  {
    .custom instance void [BenchmarkDotNet.Annotations]BenchmarkDotNet.Attributes.BenchmarkAttribute::.ctor()
      = (01 00 00 00 )
    .maxstack 2
    .locals init (
      [0] int32 a,
      [1] int32 b,
      [2] int32 V_2
    )

    // [45 13 - 45 23]
    IL_0000: volatile.
    IL_0002: ldsfld       int32 modreq ([System.Runtime]System.Runtime.CompilerServices.IsVolatile) SwapBenchmark.SwapTest::A
    IL_0007: stloc.0      // a

    // [46 13 - 46 23]
    IL_0008: volatile.
    IL_000a: ldsfld       int32 modreq ([System.Runtime]System.Runtime.CompilerServices.IsVolatile) SwapBenchmark.SwapTest::B
    IL_000f: stloc.1      // b

    // [48 13 - 48 29]
    IL_0010: ldloc.1      // b
    IL_0011: ldloc.0      // a
    IL_0012: stloc.2      // V_2
    IL_0013: stloc.0      // a
    IL_0014: ldloc.2      // V_2
    IL_0015: stloc.1      // b

    // [50 13 - 50 19]
    IL_0016: ldloc.0      // a
    IL_0017: volatile.
    IL_0019: stsfld       int32 modreq ([System.Runtime]System.Runtime.CompilerServices.IsVolatile) SwapBenchmark.SwapTest::A

    // [51 13 - 51 19]
    IL_001e: ldloc.1      // b
    IL_001f: volatile.
    IL_0021: stsfld       int32 modreq ([System.Runtime]System.Runtime.CompilerServices.IsVolatile) SwapBenchmark.SwapTest::B

    // [52 9 - 52 10]
    IL_0026: ret

  } // end of method SwapTest::SwapTuples

In theory, this method is a bit slower because utilizes more IL instructions than in the first example (5 ldloc/5 stloc vs 3 ldloc/3 stloc). Will this affect the final output generated by the JIT? The answer is: no!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
; SwapBenchmark.SwapTest.SwapTraditional()
    mov       eax,[7FFC4CC5B7AC]
    mov       edx,[7FFC4CC5B7B0]
    mov       [7FFC4CC5B7AC],edx
    mov       [7FFC4CC5B7B0],eax
    ret

; SwapBenchmark.SwapTest.SwapXor()
    mov       eax,[7FFC4CC2B7AC]
    mov       edx,[7FFC4CC2B7B0]
    xor       eax,edx
    xor       edx,eax
    xor       eax,edx
    mov       [7FFC4CC2B7AC],eax
    mov       [7FFC4CC2B7B0],edx
    ret

; SwapBenchmark.SwapTest.SwapTuples()
    mov       eax,[7FFC4CC4B7AC]
    mov       edx,[7FFC4CC4B7B0]
    mov       [7FFC4CC4B7AC],edx
    mov       [7FFC4CC4B7B0],eax
    ret

Both first and third tests have exactly the same outputs - we can see that they have been optimized to use fast registers instead of using a slower stack. So the most important conclusion is: using tuples to swap values is exactly as fast as using a temporary variable.

The method which uses XOR operations is a bit more complex - three additional xor instructions won’t speed up the whole process for sure. In the next chapter, we will do a more life example with a real algorithm to check the times.

Bubble sort

One of the best algorithms to test the execution time of swaps is a bubble sort, so we will use it to check the performance of our three methods. In this benchmark, there are three tests where each performs a sort for the array with 10000 elements (initialized with values from 10000 to 1 to get the most pessimistic case).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

namespace BubbleSortSwapBenchmark
{
    [DisassemblyDiagnoser]
    [SimpleJob(RuntimeMoniker.NetCoreApp31)]
    public class BubbleSortSwapTest
    {
        private const int N = 10000;

        [Benchmark]
        public void BubbleSortSwapTraditional()
        {
            var array = Enumerable.Range(0, N).Reverse().ToArray();
            var sorted = false;

            while (!sorted)
            {
                sorted = true;

                for (var i = 0; i < array.Length - 1; i++)
                {
                    if (array[i] > array[i + 1])
                    {
                        var temp = array[i];
                        array[i] = array[i + 1];
                        array[i + 1] = temp;

                        sorted = false;
                    }
                }
            }
        }

        [Benchmark]
        public void BubbleSortSwapXor()
        {
            var array = Enumerable.Range(0, N).Reverse().ToArray();
            var sorted = false;

            while (!sorted)
            {
                sorted = true;

                for (var i = 0; i < array.Length - 1; i++)
                {
                    if (array[i] > array[i + 1])
                    {
                        array[i] ^= array[i + 1];
                        array[i + 1] ^= array[i];
                        array[i] ^= array[i + 1];

                        sorted = false;
                    }
                }
            }
        }

        [Benchmark]
        public void BubbleSortSwapTuples()
        {
            var array = Enumerable.Range(0, N).Reverse().ToArray();
            var sorted = false;

            while (!sorted)
            {
                sorted = true;

                for (var i = 0; i < array.Length - 1; i++)
                {
                    if (array[i] > array[i + 1])
                    {
                        (array[i], array[i + 1]) = (array[i + 1], array[i]);
                        sorted = false;
                    }
                }
            }
        }
    }

    class Program
    {
        public static void Main(string[] args)
        {
            BenchmarkRunner.Run<BubbleSortSwapTest>();
        }
    }
}
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.959 (1909/November2018Update/19H2)
Intel Core i5-8300H CPU 2.30GHz (Coffee Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.5.20279.10
  [Host]        : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), X64 RyuJIT
  .NET Core 3.1 : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), X64 RyuJIT

Job=.NET Core 3.1  Runtime=.NET Core 3.1
MethodMeanErrorStdDevCode Size
BubbleSortSwapTraditional91.88 ms0.355 ms0.315 ms522 B
BubbleSortSwapXor133.82 ms0.668 ms0.624 ms540 B
BubbleSortSwapTuples91.33 ms0.359 ms0.336 ms528 B

The results are clear and valid with our earlier observations: both traditional and tuple-based swaps have exactly the same performance (differences in the mean time are within the error margin).

Summary

Sometimes, it’s not easy to tell if some solution has performance overhead, especially when we think about JIT with its complex optimization processes. As we saw in this article, a tuple-based algorithm is an excellent way to swap two elements, providing better readability and the same performance as the traditional algorithm using a temporary variable. Definitely worth to try and use in our projects.