Bit Manipulation Instructions (BMI) is an interesting extension for the x86-64 architecture, introduced by Intel in Haswell processors (early 2010s). Its main purpose is, as the name suggests, increasing the speed of the most common bit operations by replacing manual calculation with dedicated instructions (which means hardware support). This article will focus on the performance of three example instructions: BLSI (reads the lowest bit), BLSR (resets the lowest bit) and TZCNT (counts the number of trailing non-set bits).
The first time I met with Bit Manipulation Instructions was while developing a chess engine Proxima b 2.0 for my thesis. It is important to know, that nearly every modern engine represents the chessboard in memory as bitboards - a collection of unsigned 64-bit variables, where every bit represents one field. Thanks to it, a lot of things (like generating moves or checking which pieces are currently attacked) can be done fast using bit operations. The correct implementation is crucial for whole engine performance, and that’s where hardware support for more complex expressions can be really useful. BMI isn’t one solid pack of instructions - there are two main sets (which are identified separately by CPUID instruction):
Instruction name | Description |
---|---|
ANDN | Performs logical AND with negated X and Y (~x & y ) |
BEXTR | Reads n bits starting from the specified index |
BLSI | Reads the lowest bit (x & ~x ) |
BLSMSK | Creates mask for all bits up to the lowest bit (x ^ (x - 1) ) |
BLSR | Resets the lowest bit (x & (x - 1) ) |
TZCNT | Calculates the number of trailing non-set bits |
Instruction name | Description |
---|---|
BZHI | Zeros all bits from the specified index |
MULX | Performs unsigned multiplication without affecting flags |
PDEP | Stores bits using the mask |
PEXT | Extracts bits using the mask |
RORX | Rotates logical right without affecting flags |
SARX | Performs arithmetic right shift without affecting flags |
SHRX | Performs logical right shift without affecting flags |
SHLC | Performs logical left shift without affecting flags |
What’s more, AMD created its own sets:
As you can see, there are many ways to operate on bits using hardware support. In the next chapter, we will focus on the performance of three instructions: BLSI, BLSR, TZCNT. First, we will write a program that performs operations using traditional mathematical formulas and then compares their performance with BMI instructions described above.
Each test will be implemented in Rust and benchmarked by criterion.rs framework.
At first, we will test BLSI instruction, which reads the lowest set bit and returns it. The equivalent formula for this operation is x & -x
- ANDing variable with its negation, which works thanks to U2 (two’s complement) system, where negating a number is just an inversion of all bits, and then adding 1. Let’s see at the example for 8-bit number 108:
x = 01101100
-x = 10010100
01101100
AND 10010100
------------
00000100
Having this knowledge, let’s see at the example implementation in Rust.
|
|
Assembly code generated by the compiler is straightforward - we store x
in rax
register, negate it and perform logical AND with x
.
|
|
Now let’s check what we will see after using BLSI instruction. To enable BMI1 instruction set we have to set environmental variable RUSTFLAGS (RUSTFLAGS = "-C target-feature=+bmi1
) and recompile project - Rust compiler should detect and optimize the formula to blsi
without any additional action.
Now the assembly code. As expected, a few instructions has been replaced by single blsi
call with rcx
as the source, and rax
as the destination register.
|
|
Benchmark results:
Lower bound | Estimate | Upper bound | |
---|---|---|---|
Manual | 356.28 ps | 356.69 ps | 357.16 ps |
BMI1 | 356.22 ps | 356.97 ps | 357.87 ps |
BLSR instruction resets the lowest set bit and returns the new value. The equivalent formula for this operation is x & (x - 1)
- ANDing variable with its copy decreased by 1. Let’s see at the example for 8-bit number 108 (similar as in previous chapter):
x = 01101100
x - 1 = 01101011
01101100
AND 01101011
------------
01101000
|
|
Assembly code generated by the compiler is also pretty simple - lea
instruction decrements rcx
register (where the iterator is stored), and stores result in the rax
register. Both rax
and rcx
are then ANDed and the result is stored to the rax
register.
|
|
With BMI1 enabled, both instructions are merged into single blsi
with the result stored in the rax
register.
|
|
Benchmark results:
Lower bound | Estimate | Upper bound | |
---|---|---|---|
Manual | 356.12 ps | 357.31 ps | 359.10 ps |
BMI1 | 356.65 ps | 357.58 ps | 358.92 ps |
The last tested instruction, TZCNT, counts trailing non-set bits. The algorithm is based on De Bruijn multiplication which is considered as one of the fastest alternative to tzcnt
.
|
|
The output assembly is a bit more complex, but the general idea is the same as for previous examples. You can say only by looking at this, that there are a lot of instructions, which will for sure consume processor time during computation.
|
|
Now let’s look at the intrinsic function. The BMI1 equivalent of the manual counting zeros can be called by trailing_zeros()
.
|
|
|
|
Benchmark results:
Lower bound | Estimate | Upper bound | |
---|---|---|---|
Manual | 532.50 ps | 533.43 ps | 534.23 ps |
BMI1 | 330.40 ps | 331.52 ps | 332.93 ps |
All benchmark results have been compiled in the following table. It clearly suggests that using blsi
and blsr
instructions didn’t bring any noticeable performance gain - manual calculation was fast enough. Using tzcnt
was much more successful since De Bruijn multiplication involved much more assembly that could be replaced by just one instruction.
Estimate (Manual) | Estimate (BMI1) | Difference | |
---|---|---|---|
BLSI | 356.69 | 356.97 | 0,0% |
BLSR | 357.31 ps | 357.58 ps | 0,0% |
TZCNT | 533.43 ps | 331.52 ps | 62,1% |
Until recently, all responsibility for generating machine code was on the JIT (Just-in-time). It means you couldn’t tell the compiler that you want to use some specific instruction, eg. BLSI. But according to the recent news, .NET Core 3 adds an amazing feature that allows using intrinsic functions directly in the C# code. You can find them in the System.Runtime.Intrinsics.X86 namespace - don’t be suggested by the “X86” prefix, there are also versions for 64-bit functions but for some reason, they are inside X86 namespace.
I made a good use of them while developing Cosette, example can be found on the project repository in BitOperations.cs. Calls to X.IsSupported
do not have any additional overhead, since JIT correctly understands the platform on which a program runs on, so the unnecessary parts of the functions are cut off.