Inlining of intrinsic functions in .NET 5

Page content

For the last months, I was focused on developing Cosette chess engine, toying with a very wide spectrum of algorithms and language optimizations. The second one was especially interesting because of .NET 5 release, which brings a few interesting speed improvements (I will try to write something about them in the context of the engine soon). In this article, I will show an interesting behavior of inlining intrinsic function in some specific case which led to the dropping of performance in the chess engine.

Mysterious inlining

Let’s say we want to write a method, which basically will be a wrapper for some intrinsic function, ResetLowestSetBit on example - this allows us to give a processor some number, and read a new one with the lowest bit set to zero. It may look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
namespace PopLsbTest
{
    class Program
    {
        static ulong x;
        static void Main()
        {
            x = PopLsb1(100);
        }
        
        static ulong PopLsb1(ulong value)
        {
            return System.Runtime.Intrinsics.X86.Bmi1.X64.ResetLowestSetBit(value);
        }
    }
}

We marked our x variable as static (instead of doing a simple local one inside the Main) to prevent the compiler from removing our PopLsb1 call. Because this method is so short, it’s highly expected that JIT will inline it and the assembly function (blsr in this case) will be called directly in the Main - and that’s exactly what happens:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
; Core CLR v5.0.220.61120 on amd64

PopLsbTest.Program..ctor()
    L0000: ret

PopLsbTest.Program.Main()
    L0000: sub rsp, 0x28
    L0004: mov rcx, 0x7ff91bc8cb68
    L000e: xor edx, edx
    L0010: call 0x00007ff9731fa420
    L0015: mov edx, 0x64
    L001a: blsr rdx, rdx
    L001f: mov [rax+8], rdx
    L0023: add rsp, 0x28
    L0027: ret

PopLsbTest.Program.PopLsb1(UInt64)
    L0000: blsr rax, rcx
    L0005: ret

Although I didn’t find to what refers the first call, we can safely assume this is some internal thing generated by JIT, so it’s not going to be interesting for us. The more curious instruction is our blsr which has been inlined by JIT, so the program won’t have any overhead related to stack operations.

Now let’s assume that our program can potentially work on processors that don’t support BMI1 set, to which blsr belongs. The software emulation of this operation is equal to x & (x - 1), so let’s try to do a simple condition.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
namespace PopLsbTest
{
    class Program
    {
        static ulong x;
        static void Main()
        {
            x = PopLsb2(100);
        }
        
        static ulong PopLsb2(ulong value)
        {
            if (System.Runtime.Intrinsics.X86.Bmi1.IsSupported)
            {
                return System.Runtime.Intrinsics.X86.Bmi1.X64.ResetLowestSetBit(value);
            }
            else
            {
                return value & (value - 1);
            }
        }
    }
}

The important thing about which we have to know: JIT actually can detect if the processor supports the instruction set specified by IsSpecified property and deletes the whole condition leaving the correct code. In the case of my CPU, the program will always execute ResetLowestSetBit without checking every time if the instruction set is available (because that would be pointless, we don’t assume that processor will change during work). So what about inlining of this thing? In theory, it should behave the same as in the first example, because the body of this method will be the same (simple blsr call). Is it though?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
; Core CLR v5.0.220.61120 on amd64

PopLsbTest.Program..ctor()
    L0000: ret

PopLsbTest.Program.Main()
    L0000: push rsi
    L0001: sub rsp, 0x20
    L0005: mov rcx, 0x7ff91c51cb60
    L000f: xor edx, edx
    L0011: call 0x00007ff9731fa420
    L0016: mov rsi, rax
    L0019: mov ecx, 0x64
    L001e: call PopLsbTest.Program.PopLsb2(UInt64)
    L0023: mov [rsi+8], rax
    L0027: add rsp, 0x20
    L002b: pop rsi
    L002c: ret

PopLsbTest.Program.PopLsb2(UInt64)
    L0000: blsr rax, rcx
    L0005: ret

And here we have a problem: JIT simplified PopLsb2 method correctly, but for some reason, it’s not inlined, so we see call in the Main. I was quite concerned when I discovered it some time ago because it meant that my chess engine was doing millions of pointless calls per second instead of executing the specified assembly instruction directly. Fortunately, this can be easily fixed by forcing inlining by adding [MethodImpl(MethodImplOptions.AggressiveInlining)] attribute and that’s what I did in my engine - it gave a few percents of nice performance improvement, so a big win.

Explanation

Intrigued by this bug, I made an issue (Weird inlining of methods optimized by JIT), to see what more experienced developers related to .NET development can say about it. They wrote excellent replies with detailed descriptions about JIT internals and linked to even more posts about it (AggressiveInlining not respected when Method grow too large and JIT: allow some aggressive inlines to go over budget), so I’m grateful. I don’t feel rephrasing them here would add anything new, so I suggest reading these topics to get a better understanding of what happened.

Summary

I found this issue using an excellent tool https://sharplab.io/, which can display assembly code generated by JIT (with all optimizations) - I can recommend this if you’re micro optimizing your application. As we can read in the replies made in the issue, .NET has still a few areas which hopefully will be improved in the future and produce an even faster assembly.