Understanding compiler optimizations with assembly
Some bioinformatics engineers spend a lot of time and energy making their tools more efficient. When this time is spent designing the overall algorithms and data structures, this can be extremely advantageous. However, when trying to make low-level optimizations on single operations, the compiler is usually going to optimize these and the programmer should instead focus on making the code readable and easier to maintain.
As a bioinformatician without formal training in computer science, there is a lot that I don’t know about compilers and optimization, but here is a nonexhaustive list with some key optimizations with the assembly that I find particularly helpful to understand in creating efficient software. The beauty of it is that, apart from the design considerations, many of these are done by the compiler by simply enabling optimizations.
Inspecting compiler output
The best way to understand what is happening in your code is to look at the assembly. There are good, readily-available tools for understanding compiler output:
- Compile with
gcc -fopt-info
- godbolt.org
objdump -d -M intel <binary>
General considerations
It is far more important to design software that is algorithmically efficient. It is extremely common in bioinformatics that this gets overlooked and users end up throwing compute at the problem. Do you really need 40 CPUs and 500Gb RAM? Sometimes, but not usually. If this is the case, you need to look at your design and memory usage, the below optimizations will not help. These are just for eking out the last bit of performance and, more than anything, they are helpful for understanding what is actually going on in your code.
Inlining
Function inlining is a simple optimization that replaces the function call with the body of the function itself. This can certainly save some time, but at the expense of a larger binary. In some cases, the compiler eliminates the function calls and the stack operations that are associated with them. For example:
int square(int x) {
return x * x;
}
int add_and_square(int x, int y) {
return square(x) + y;
}
If we compile these without inlining, we get the following assembly:
square:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
imul eax, eax
pop rbp
ret
add_and_square:
push rbp
mov rbp, rsp
sub rsp, 8
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov eax, DWORD PTR [rbp-4]
mov edi, eax
call square
mov edx, DWORD PTR [rbp-8]
add eax, edx
leave
ret
In both functions, the first instruction is push rbp
which saves the base pointer of the caller function, followed by the creation of a new frame on the stack for the callee: mov rbp, rsp
. After the body of the function, we restore the base pointer of the caller and return. At least 4 instructions just for calling the function.
In the function add_and_square
we are calling the function square
. If we let the compiler optimize, with -O1
or above, we will see something like this:
square:
imul edi, edi
mov eax, edi
ret
add_and_square:
imul edi, edi
lea eax, [rdi+rsi]
ret
First, there is no stack setup or teardown for our functions in either case, but that is out of scope for this example. The important detail is that the compiler has automatically inlined square
and replaced the call to square
with the instruction imul edi, edi
. This takes the first argument and multiplies it by itself.
While they aren’t necessarily huge, one can easily imagine the benefits of this when small functions are called many times. Of course, a programmer could do some amount of inlining themselves but at the expense of readability and maintainability.
The inline
keyword is a suggestion and doesn’t guarantee that your function will be inlined. Besides looking at the assembly dump, one way to check is to compile with -fopt-info
and inspect the output.
Automatic vectorization of loops
Often times, the compiler will attempt to recognize loops that can be vectorized and compile them to SIMD instructions using vector registers. This can give a huge performance boost.
An example of a loop that can be auto vectorized:
void auto_vec(int* c, int* b, int* a, int n) {
for (int i = 0; i < n; i++) {
c[i] = b[i] + a[i];
}
}
If we compile this with -O2
, we get a simple loop:
auto_vec:
mov r8, rdx
test ecx, ecx
jle .L1
movsx rcx, ecx
xor eax, eax
sal rcx, 2
.L3:
mov edx, DWORD PTR [r8+rax]
add edx, DWORD PTR [rsi+rax]
mov DWORD PTR [rdi+rax], edx
add rax, 4
cmp rcx, rax
jne .L3
.L1:
ret
Initial checks are performed and then it moves into to the main body of the loop, .L3
. We move a[i]
into edx
, add b[i]
to edx
, and move edx
to c[i]
. We increment i (add rax, 4
) and check if the loop is done.
If we compile with gcc -O2 -ftree-vectorize
we can look at the godbolt output (full output at the bottom of the page), we can see that there is a lot more going on. The scalar loop from above is there as a runtime fallback if for any reason the loop can’t be vectorized (.L3
.L9
), and there are additional SIMD instructions with vector registers to optimize where we can.
.L5:
movdqu xmm0, XMMWORD PTR [rsi+rax]
movdqu xmm2, XMMWORD PTR [rdx+rax]
paddd xmm0, xmm2
movups XMMWORD PTR [rdi+rax], xmm0
add rax, 16
cmp r8, rax
jne .L5
The xmm0
and xmm2
are 128-bit registers that are used in this case to hold 4 integers from a
and b
at a time. We add them with paddd
and then store the result in c
. In this example, each iteration of the loop processes 4 elements from a
and b
simultaneously.
This can provide a significant performance boost. To check if your compiler is vectorizing your loops:
- Ensure that you are using the right flags. If gcc, at least
-O2
or-ftree-vectorize
- Inspect the output when compiling with
-fopt-info-vec -fopt-info-vec-missed
to get detailed summary of what was optimized, missed, and why.
Loop unrolling
Loop unrolling can provide a performance boost by processing the data in chunks and using fewer loop related instructions. For example, we can (but shouldn’t) manually unroll this loop by using a larger step size with something like this:
void add_ints(int* x, int* y, int* z, int n) {
for (int i = 0; i < n; i++) {
z[i] = x[i] + y[i];
}
}
void unrolled_add_ints(int* x, int* y, int* z, int n) {
int i;
for (i = 0; i <= n - 4; i += 4) {
z[i] = x[i] + y[i];
z[i + 1] = x[i + 1] + y[i + 1];
z[i + 2] = x[i + 2] + y[i + 2];
z[i + 3] = x[i + 3] + y[i + 3];
}
}
If we inspect the assembly for each of these with -O3 -funroll-loops
we will see that the compiler optimizes the first with SIMD vectorization and unrolled loops. 4 elements of the arrays are added at a time, and the loop progresses in step sizes of 4. This can have a significant impact on execution time when dealing with a large amount of data. It also is a strong piece of evidence that the programmer should trust the compiler, as the manually unrolled loop is not pretty to read.
Integer swap without a temp variable
Sometimes what you think you are doing and what the compiler actually does is different. A classic example that illustrates this (and common interview question) is the difference between using a third variable vs XOR to swap two ints. The first example uses a temp variable to swap the two and the second example uses the xor.
void swap(int* a, int* b) {
int c = *a;
*a = *b;
*b = c;
}
void xor_swap(int* a, int* b) {
*a ^= *b;
*b ^= *a;
*a ^= *b;
}
To compare these two implementations of the same thing, we can look at the assembly generated with -O1
or higher.
swap:
mov eax, DWORD PTR [rdi]
mov edx, DWORD PTR [rsi]
mov DWORD PTR [rdi], edx
mov DWORD PTR [rsi], eax
ret
xor_swap:
mov eax, DWORD PTR [rdi]
xor eax, DWORD PTR [rsi]
mov DWORD PTR [rdi], eax
xor eax, DWORD PTR [rsi]
mov DWORD PTR [rsi], eax
xor DWORD PTR [rdi], eax
ret
The compiler has optimized swap
to 4 instructions and xor_swap
to 6. In swap
, the compiler is using two registers (eax
and edx
) to load a
and b
. It then stores the values from eax
and edx
to the memory pointed to by a
and b
. In the xor_swap
, we are using a single register and doing 6 operations involving memory reading and writing. While it might be a neat trick, XOR swap is slower, fragile, much harder to read, and uses a register as a “third temp variable” anyways.
Complete auto_vec
assembly
auto_vec:
test ecx, ecx
jle .L1
cmp ecx, 1
je .L3
lea rax, [rdi-4]
mov r8, rax
sub r8, rsi
cmp r8, 8
jbe .L3
sub rax, rdx
cmp rax, 8
jbe .L3
lea eax, [rcx-1]
mov r9d, ecx
cmp eax, 2
jbe .L11
mov r8d, ecx
xor eax, eax
shr r8d, 2
sal r8, 4
.L5:
movdqu xmm0, XMMWORD PTR [rsi+rax]
movdqu xmm2, XMMWORD PTR [rdx+rax]
paddd xmm0, xmm2
movups XMMWORD PTR [rdi+rax], xmm0
add rax, 16
cmp r8, rax
jne .L5
mov eax, ecx
and eax, -4
mov r8d, eax
cmp ecx, eax
je .L1
sub ecx, eax
mov r9d, ecx
cmp ecx, 1
je .L7
.L4:
mov ecx, r8d
movq xmm0, QWORD PTR [rsi+rcx*4]
movq xmm1, QWORD PTR [rdx+rcx*4]
paddd xmm0, xmm1
movq QWORD PTR [rdi+rcx*4], xmm0
test r9b, 1
je .L1
and r9d, -2
add eax, r9d
.L7:
cdqe
mov edx, DWORD PTR [rdx+rax*4]
add edx, DWORD PTR [rsi+rax*4]
mov DWORD PTR [rdi+rax*4], edx
ret
.L3:
movsx rcx, ecx
xor eax, eax
sal rcx, 2
.L9:
mov r8d, DWORD PTR [rdx+rax]
add r8d, DWORD PTR [rsi+rax]
mov DWORD PTR [rdi+rax], r8d
add rax, 4
cmp rax, rcx
jne .L9
.L1:
ret
.L11:
xor r8d, r8d
xor eax, eax
jmp .L4