For the last months, I was focused on developing Cosette chess engine, toying with a very wide spectrum of algorithms and language optimizations. The second one was especially interesting because of .NET 5 release, which brings a few interesting speed improvements (I will try to write something about them in the context of the engine soon). In this article, I will show an interesting behavior of inlining intrinsic function in some specific case which led to the dropping of performance in the chess engine.
Let’s say we want to write a method, which basically will be a wrapper for some intrinsic function, ResetLowestSetBit on example - this allows us to give a processor some number, and read a new one with the lowest bit set to zero. It may look like this:
|
|
We marked our x
variable as static (instead of doing a simple local one inside the Main
) to prevent the compiler from removing our PopLsb1
call. Because this method is so short, it’s highly expected that JIT will inline it and the assembly function (blsr in this case) will be called directly in the Main
- and that’s exactly what happens:
|
|
Although I didn’t find to what refers the first call, we can safely assume this is some internal thing generated by JIT, so it’s not going to be interesting for us. The more curious instruction is our blsr which has been inlined by JIT, so the program won’t have any overhead related to stack operations.
Now let’s assume that our program can potentially work on processors that don’t support BMI1 set, to which blsr belongs. The software emulation of this operation is equal to x & (x - 1)
, so let’s try to do a simple condition.
|
|
The important thing about which we have to know: JIT actually can detect if the processor supports the instruction set specified by IsSpecified
property and deletes the whole condition leaving the correct code. In the case of my CPU, the program will always execute ResetLowestSetBit without checking every time if the instruction set is available (because that would be pointless, we don’t assume that processor will change during work). So what about inlining of this thing? In theory, it should behave the same as in the first example, because the body of this method will be the same (simple blsr call). Is it though?
|
|
And here we have a problem: JIT simplified PopLsb2
method correctly, but for some reason, it’s not inlined, so we see call in the Main
. I was quite concerned when I discovered it some time ago because it meant that my chess engine was doing millions of pointless calls per second instead of executing the specified assembly instruction directly. Fortunately, this can be easily fixed by forcing inlining by adding [MethodImpl(MethodImplOptions.AggressiveInlining)]
attribute and that’s what I did in my engine - it gave a few percents of nice performance improvement, so a big win.
Intrigued by this bug, I made an issue (Weird inlining of methods optimized by JIT), to see what more experienced developers related to .NET development can say about it. They wrote excellent replies with detailed descriptions about JIT internals and linked to even more posts about it (AggressiveInlining not respected when Method grow too large and JIT: allow some aggressive inlines to go over budget), so I’m grateful. I don’t feel rephrasing them here would add anything new, so I suggest reading these topics to get a better understanding of what happened.
I found this issue using an excellent tool https://sharplab.io/, which can display assembly code generated by JIT (with all optimizations) - I can recommend this if you’re micro optimizing your application. As we can read in the replies made in the issue, .NET has still a few areas which hopefully will be improved in the future and produce an even faster assembly.