Performance of Bit Manipulation Instructions (BMI)
Bit Manipulation Instructions (BMI) is an interesting extension for the x8664 architecture, introduced by Intel in Haswell processors (early 2010s). Its main purpose is, as the name suggests, increasing the speed of the most common bit operations by replacing manual calculation with dedicated instructions (which means hardware support). This article will focus on the performance of three example instructions: BLSI (reads the lowest bit), BLSR (resets the lowest bit) and TZCNT (counts the number of trailing nonset bits).
Theory
The first time I met with Bit Manipulation Instructions was while developing a chess engine Proxima b 2.0 for my thesis. It is important to know, that nearly every modern engine represents the chessboard in memory as bitboards  a collection of unsigned 64bit variables, where every bit represents one field. Thanks to it, a lot of things (like generating moves or checking which pieces are currently attacked) can be done fast using bit operations. The correct implementation is crucial for whole engine performance, and that’s where hardware support for more complex expressions can be really useful. BMI isn’t one solid pack of instructions  there are two main sets (which are identified separately by CPUID instruction):
 BMI1 contains 6 new instructions to resetting, extracting, counting and comparing
Instruction name  Description 

ANDN  Performs logical AND with negated X and Y (~x & y ) 
BEXTR  Reads n bits starting from the specified index 
BLSI  Reads the lowest bit (x & ~x ) 
BLSMSK  Creates mask for all bits up to the lowest bit (x ^ (x  1) ) 
BLSR  Resets the lowest bit (x & (x  1) ) 
TZCNT  Calculates the number of trailing nonset bits 
 BMI2 contains 8 new instructions mainly for shifting, rotating and parallel operations
Instruction name  Description 

BZHI  Zeros all bits from the specified index 
MULX  Performs unsigned multiplication without affecting flags 
PDEP  Stores bits using the mask 
PEXT  Extracts bits using the mask 
RORX  Rotates logical right without affecting flags 
SARX  Performs arithmetic right shift without affecting flags 
SHRX  Performs logical right shift without affecting flags 
SHLC  Performs logical left shift without affecting flags 
What’s more, AMD created its own sets:
 ABM (Advanced Bit Manipulation) with two instructions: POPCNT (calculates the number of set bits) and LZCNT (calculates the number of leading nonset bits).
 TBM (Trailing Bit Manipulation)  a set of 10 instructions whose goal was to refill the original BMI set. You can read more about them in the AMD documentation.
As you can see, there are many ways to operate on bits using hardware support. In the next chapter, we will focus on the performance of three instructions: BLSI, BLSR, TZCNT. First, we will write a program that performs operations using traditional mathematical formulas and then compares their performance with BMI instructions described above.
Implementation
Each test will be implemented in Rust and benchmarked by criterion.rs framework.
BLSI
At first, we will test BLSI instruction, which reads the lowest set bit and returns it. The equivalent formula for this operation is x & x
 ANDing variable with its negation, which works thanks to U2 (two’s complement) system, where negating a number is just an inversion of all bits, and then adding 1. Let’s see at the example for 8bit number 108:
x = 01101100
x = 10010100
01101100
AND 10010100

00000100
Having this knowledge, let’s see at the example implementation in Rust.


Assembly code generated by the compiler is straightforward  we store x
in rax
register, negate it and perform logical AND with x
.


Now let’s check what we will see after using BLSI instruction. To enable BMI1 instruction set we have to set environmental variable RUSTFLAGS (RUSTFLAGS = "C targetfeature=+bmi1
) and recompile project  Rust compiler should detect and optimize the formula to blsi
without any additional action.
Now the assembly code. As expected, a few instructions has been replaced by single blsi
call with rcx
as the source, and rax
as the destination register.


Benchmark results:
Lower bound  Estimate  Upper bound  

Manual  356.28 ps  356.69 ps  357.16 ps 
BMI1  356.22 ps  356.97 ps  357.87 ps 
BLSR
BLSR instruction resets the lowest set bit and returns the new value. The equivalent formula for this operation is x & (x  1)
 ANDing variable with its copy decreased by 1. Let’s see at the example for 8bit number 108 (similar as in previous chapter):
x = 01101100
x  1 = 01101011
01101100
AND 01101011

01101000


Assembly code generated by the compiler is also pretty simple  lea
instruction decrements rcx
register (where the iterator is stored), and stores result in the rax
register. Both rax
and rcx
are then ANDed and the result is stored to the rax
register.


With BMI1 enabled, both instructions are merged into single blsi
with the result stored in the rax
register.


Benchmark results:
Lower bound  Estimate  Upper bound  

Manual  356.12 ps  357.31 ps  359.10 ps 
BMI1  356.65 ps  357.58 ps  358.92 ps 
TZCNT
The last tested instruction, TZCNT, counts trailing nonset bits. The algorithm is based on De Bruijn multiplication which is considered as one of the fastest alternative to tzcnt
.


The output assembly is a bit more complex, but the general idea is the same as for previous examples. You can say only by looking at this, that there are a lot of instructions, which will for sure consume processor time during computation.


Now let’s look at the intrinsic function. The BMI1 equivalent of the manual counting zeros can be called by trailing_zeros()
.




Benchmark results:
Lower bound  Estimate  Upper bound  

Manual  532.50 ps  533.43 ps  534.23 ps 
BMI1  330.40 ps  331.52 ps  332.93 ps 
Comparison
All benchmark results have been compiled in the following table. It clearly suggests that using blsi
and blsr
instructions didn’t bring any noticeable performance gain  manual calculation was fast enough. Using tzcnt
was much more successful since De Bruijn multiplication involved much more assembly that could be replaced by just one instruction.
Estimate (Manual)  Estimate (BMI1)  Difference  

BLSI  356.69  356.97  0,0% 
BLSR  357.31 ps  357.58 ps  0,0% 
TZCNT  533.43 ps  331.52 ps  62,1% 
A few words about C#
Until recently, all responsibility for generating machine code was on the JIT (Justintime). It means you couldn’t tell the compiler that you want to use some specific instruction, eg. BLSI. But according to the recent news, .NET Core 3 adds an amazing feature that allows using intrinsic functions directly in the C# code. You can find them in the System.Runtime.Intrinsics.X86 namespace  don’t be suggested by the “X86” prefix, there are also versions for 64bit functions but for some reason, they are inside X86 namespace.
 BLSI  System.Runtime.Intrinsics.X86.Bmi1.X64.ExtractLowestSetBit
 BLSR  System.Runtime.Intrinsics.X86.Bmi1.X64.ResetLowestSetBit
 TZCNT  System.Numerics.BitOperations.TrailingZeroCount
I made a good use of them while developing Cosette, example can be found on the project repository in BitOperations.cs. Calls to X.IsSupported
do not have any additional overhead, since JIT correctly understands the platform on which a program runs on, so the unnecessary parts of the functions are cut off.