Performance of Bit Manipulation Instructions (BMI)

2024-05-08

Page content

Bit Manipulation Instructions (BMI) is an interesting extension for the x86-64 architecture, introduced by Intel in Haswell processors (early 2010s). Its main purpose is, as the name suggests, increasing the speed of the most common bit operations by replacing manual calculation with dedicated instructions (which means hardware support). This article will focus on the performance of three example instructions: BLSI (reads the lowest bit), BLSR (resets the lowest bit) and TZCNT (counts the number of trailing non-set bits).

Theory

The first time I met with Bit Manipulation Instructions was while developing a chess engine Proxima b 2.0 for my thesis. It is important to know, that nearly every modern engine represents the chessboard in memory as bitboards - a collection of unsigned 64-bit variables, where every bit represents one field. Thanks to it, a lot of things (like generating moves or checking which pieces are currently attacked) can be done fast using bit operations. The correct implementation is crucial for whole engine performance, and that’s where hardware support for more complex expressions can be really useful. BMI isn’t one solid pack of instructions - there are two main sets (which are identified separately by CPUID instruction):

BMI1 contains 6 new instructions to resetting, extracting, counting and comparing

Instruction name	Description
ANDN	Performs logical AND with negated X and Y (`~x & y`)
BEXTR	Reads n bits starting from the specified index
BLSI	Reads the lowest bit (`x & ~x`)
BLSMSK	Creates mask for all bits up to the lowest bit (`x ^ (x - 1)`)
BLSR	Resets the lowest bit (`x & (x - 1)`)
TZCNT	Calculates the number of trailing non-set bits

BMI2 contains 8 new instructions mainly for shifting, rotating and parallel operations

Instruction name	Description
BZHI	Zeros all bits from the specified index
MULX	Performs unsigned multiplication without affecting flags
PDEP	Stores bits using the mask
PEXT	Extracts bits using the mask
RORX	Rotates logical right without affecting flags
SARX	Performs arithmetic right shift without affecting flags
SHRX	Performs logical right shift without affecting flags
SHLC	Performs logical left shift without affecting flags

What’s more, AMD created its own sets:

ABM (Advanced Bit Manipulation) with two instructions: POPCNT (calculates the number of set bits) and LZCNT (calculates the number of leading non-set bits).
TBM (Trailing Bit Manipulation) - a set of 10 instructions whose goal was to refill the original BMI set. You can read more about them in the AMD documentation.

As you can see, there are many ways to operate on bits using hardware support. In the next chapter, we will focus on the performance of three instructions: BLSI, BLSR, TZCNT. First, we will write a program that performs operations using traditional mathematical formulas and then compares their performance with BMI instructions described above.

Implementation

Each test will be implemented in Rust and benchmarked by criterion.rs framework.

BLSI

At first, we will test BLSI instruction, which reads the lowest set bit and returns it. The equivalent formula for this operation is x & -x - ANDing variable with its negation, which works thanks to U2 (two’s complement) system, where negating a number is just an inversion of all bits, and then adding 1. Let’s see at the example for 8-bit number 108:

    x = 01101100
   -x = 10010100

        01101100
    AND 10010100
    ------------
        00000100

Having this knowledge, let’s see at the example implementation in Rust.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
use criterion::black_box;
use criterion::criterion_group;
use criterion::criterion_main;
use criterion::Criterion;

fn benchmark(criterion: &mut Criterion) {
    criterion.bench_function("get_lsb", |bencher| bencher.iter(|| get_lsb(black_box(45631345))));
}

fn get_lsb(x: u64) -> u64 {
    x & x.wrapping_neg()
}

criterion_group!(benches, benchmark);
criterion_main!(benches);

Assembly code generated by the compiler is straightforward - we store x in rax register, negate it and perform logical AND with x.

1
2
3
4
mov rax, rcx
neg rax
and rax, rcx
ret

Now let’s check what we will see after using BLSI instruction. To enable BMI1 instruction set we have to set environmental variable RUSTFLAGS (RUSTFLAGS = "-C target-feature=+bmi1) and recompile project - Rust compiler should detect and optimize the formula to blsi without any additional action.

Now the assembly code. As expected, a few instructions has been replaced by single blsi call with rcx as the source, and rax as the destination register.

1
2
blsi rax, rcx
ret

Benchmark results:

	Lower bound	Estimate	Upper bound
Manual	356.28 ps	356.69 ps	357.16 ps
BMI1	356.22 ps	356.97 ps	357.87 ps

BLSR

BLSR instruction resets the lowest set bit and returns the new value. The equivalent formula for this operation is x & (x - 1) - ANDing variable with its copy decreased by 1. Let’s see at the example for 8-bit number 108 (similar as in previous chapter):

    x = 01101100
x - 1 = 01101011

        01101100
    AND 01101011
    ------------
        01101000

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
use criterion::black_box;
use criterion::criterion_group;
use criterion::criterion_main;
use criterion::Criterion;

fn benchmark(criterion: &mut Criterion) {
    criterion.bench_function("pop_lsb", |bencher| bencher.iter(|| get_lsb(black_box(45631345))));
}

fn pop_lsb(x: u64) -> u64 {
    x & (x - 1)
}

criterion_group!(benches, benchmark);
criterion_main!(benches);

Assembly code generated by the compiler is also pretty simple - lea instruction decrements rcx register (where the iterator is stored), and stores result in the rax register. Both rax and rcx are then ANDed and the result is stored to the rax register.

1
2
3
lea rax, [rcx - 1]
and rax, rcx
ret

With BMI1 enabled, both instructions are merged into single blsi with the result stored in the rax register.

1
2
blsi rax, rcx
ret

Benchmark results:

	Lower bound	Estimate	Upper bound
Manual	356.12 ps	357.31 ps	359.10 ps
BMI1	356.65 ps	357.58 ps	358.92 ps

TZCNT

The last tested instruction, TZCNT, counts trailing non-set bits. The algorithm is based on De Bruijn multiplication which is considered as one of the fastest alternative to tzcnt.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
use criterion::black_box;
use criterion::criterion_group;
use criterion::criterion_main;
use criterion::Criterion;

fn benchmark(criterion: &mut Criterion) {
    criterion.bench_function("bit_scan", |bencher| bencher.iter(|| bit_scan(black_box(45631345))));
}

fn bit_scan(x: u64) -> u64 {
    #[rustfmt::skip]
    const INDEX64: [u32; 64] = [
        0,  1,  48,  2, 57, 49, 28,  3,
        61, 58, 50, 42, 38, 29, 17,  4,
        62, 55, 59, 36, 53, 51, 43, 22,
        45, 39, 33, 30, 24, 18, 12,  5,
        63, 47, 56, 27, 60, 41, 37, 16,
        54, 35, 52, 21, 44, 32, 23, 11,
        46, 26, 40, 15, 34, 20, 31, 10,
        25, 14, 19,  9, 13,  8,  7,  6
    ];

    INDEX64[(((x & x.wrapping_neg()).wrapping_mul(0x03f79d71b4cb0a89)) >> 58) as usize]
}

criterion_group!(benches, benchmark);
criterion_main!(benches);

The output assembly is a bit more complex, but the general idea is the same as for previous examples. You can say only by looking at this, that there are a lot of instructions, which will for sure consume processor time during computation.

1
2
3
4
5
6
7
8
9
mov rax, rcx
neg rax
and rax, rcx
movabs rcx, 285870213051386505
imul rcx, rax
shr rcx, 58
lea rax, [rip + anon.9e0e815cd4fdb58dcb37887dde8870f8.1020]
mov eax, dword ptr [rax + 4*rcx]
ret

Now let’s look at the intrinsic function. The BMI1 equivalent of the manual counting zeros can be called by trailing_zeros().

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
use criterion::black_box;
use criterion::criterion_group;
use criterion::criterion_main;
use criterion::Criterion;

fn benchmark(criterion: &mut Criterion) {
    criterion.bench_function("bit_scan", |bencher| bencher.iter(|| bit_scan(black_box(45631345))));
}

fn bit_scan(x: u64) -> u64 {
    x.trailing_zeros()
}

criterion_group!(benches, benchmark);
criterion_main!(benches);

1
2
tzcnt rax, rcx
ret

Benchmark results:

	Lower bound	Estimate	Upper bound
Manual	532.50 ps	533.43 ps	534.23 ps
BMI1	330.40 ps	331.52 ps	332.93 ps

Comparison

All benchmark results have been compiled in the following table. It clearly suggests that using blsi and blsr instructions didn’t bring any noticeable performance gain - manual calculation was fast enough. Using tzcnt was much more successful since De Bruijn multiplication involved much more assembly that could be replaced by just one instruction.

	Estimate (Manual)	Estimate (BMI1)	Difference
BLSI	356.69	356.97	0,0%
BLSR	357.31 ps	357.58 ps	0,0%
TZCNT	533.43 ps	331.52 ps	62,1%

A few words about C#

Until recently, all responsibility for generating machine code was on the JIT (Just-in-time). It means you couldn’t tell the compiler that you want to use some specific instruction, eg. BLSI. But according to the recent news, .NET Core 3 adds an amazing feature that allows using intrinsic functions directly in the C# code. You can find them in the System.Runtime.Intrinsics.X86 namespace - don’t be suggested by the “X86” prefix, there are also versions for 64-bit functions but for some reason, they are inside X86 namespace.

BLSI - System.Runtime.Intrinsics.X86.Bmi1.X64.ExtractLowestSetBit
BLSR - System.Runtime.Intrinsics.X86.Bmi1.X64.ResetLowestSetBit
TZCNT - System.Numerics.BitOperations.TrailingZeroCount

I made a good use of them while developing Cosette, example can be found on the project repository in BitOperations.cs. Calls to X.IsSupported do not have any additional overhead, since JIT correctly understands the platform on which a program runs on, so the unnecessary parts of the functions are cut off.