## Sunday, April 28, 2019

### How LLVM optimizes power sums

LLVM optimizes power sums, such as
int sum(int count)
{
int result = 0;

for (int j = 0; j < count; ++j)
result += j*j;

return result;
}

to code that calculates the result without a loop (godbolt)
sum(int):
test    edi, edi
jle     .LBB0_1
lea     eax, [rdi - 1]
lea     ecx, [rdi - 2]
imul    rcx, rax
lea     eax, [rdi - 3]
imul    rax, rcx
shr     rax
imul    eax, eax, 1431655766
shr     rcx
lea     ecx, [rcx + 2*rcx]
lea     eax, [rax + rcx]
ret
.LBB0_1:
xor     eax, eax
ret
It handles more complex cases too (godbolt) – that is, the optimization is not just a simple pattern matcher. This post will show how the optimization is done.

## Loop analysis – scalar evolution

There are many cases where compilers need to track how values are updated within a loop. For example, the loop vectorizer needs to check that the pointers are moved to the adjacent element in the next iteration, and check that no other pointer indexing may alias the range we are vectorizing.

Both GCC and LLVM does this in the same way in their scalar evolution passes, where each variable at iteration $$i$$ (we start enumerating iterations from $$0$$) is represented as a function $$f_0(i)$$ defined as a linear recurrence of the form
$f_j(i) = \begin{cases} \phi_j & \text{if i = 0} \\ f_j(i-1) \odot_{j+1} f_{j+1}(i-1) & \text{if i > 0} \end{cases}$ where $$\odot\in\{+,*\}$$.

### Example 1

The simplest case is a loop such as
void foo(int m, int *p)
{
for (int j = 0; j < m; j++)
*p++ = j;
}

The loop writes $$0$$ to *p++ in the first iteration, $$1$$ in the second, etc. So we can express the value written at iteration $$i$$ as$f(i) = \begin{cases} 0 & \text{if i = 0} \\ f(i-1) + 1 & \text{if i > 0} \end{cases}$

### Example 2

Polynomials in the iteration variable can also be expressed in this form.
void foo(int m, int k, int *p)
{
for (int j = 0; < m; j++)
*p++ = j*j*j - 2*j*j + k*j + 7;
}

We will see below how to build the functions, but the result for the value stored in this loop is \begin{align}f_2(i) & = \begin{cases} 2\phantom{f_0(i-1) + f_1(i-1)} & \text{if i = 0} \\ f_2(i-1) + 6 & \text{if i > 0} \end{cases}\\ f_1(i) & = \begin{cases} k-1 & \text{if i = 0} \\ f_1(i-1) + f_2(i-1)\phantom{2} & \text{if i > 0} \end{cases}\\ f(i) = f_0(i) & = \begin{cases} 7 & \text{if i = 0} \\ f_0(i-1) + f_1(i-1)\phantom{2} & \text{if i > 0} \end{cases}\end{align} One optimization we can see directly from these functions is that the value can be calculated by just three additions within the loop
void foo(int m, int k, int *p)
{
int t0 = 7;
int t1 = k-1;
int t2 = 2;
for (int j = 0; j < m; j++) {
*p++ = t0;
t0 = t0 + t1;
t1 = t1 + t2;
t2 = t2 + 6;
}
}

which is a useful optimization for architectures were multiplication is expensive. This kind of code is, however, uncommon, so most compilers do not do this optimzation – but they usually do this for simpler cases, such as
void foo(int m, int k, int *p)
{
for (int j = 0; < m; j++)
*p++ = k*j + 7;
}

as constructs of the form k*j+7 are common in address calculations.

## Chains of recurrences

It is cumbersome to write the recursive functions all the time, so the functions are usually written in the form $$\{\phi_j, \odot_{j+1}, f_{j+1}\}$$. For example \begin{align}f_2(i) & = \begin{cases} 2\phantom{f_0(i-1) + f_1(i-1)} & \text{if i = 0} \\ f_2(i-1) + 6 & \text{if i > 0} \end{cases} \phantom{xx}\text{is written as \{2,+,6\}}\\ f_1(i) & = \begin{cases} k-1 & \text{if i = 0} \\ f_1(i-1) + f_2(i-1)\phantom{2} & \text{if i > 0} \end{cases} \phantom{xx}\text{is written as \{k-1,+,f_2\}}\\ f(i) = f_0(i) & = \begin{cases} 7 & \text{if i = 0} \\ f_0(i-1) + f_1(i-1)\phantom{2} & \text{if i > 0} \end{cases} \phantom{xx}\text{is written as \{7,+,f_1\}}\end{align} These can be chained, so $$f(i)$$ can be written as a chain of recurrences (CR) $$\{7,+,\{k-1,+,\{2,+,6\}\}\}$$. The inner curly braces are redundant, so the CR is usually written as a tuple $$\{7,+,k-1,+,2,+,6\}$$.

## Building the chains of recurrences

The chains of recurrences are built by iterating over the code and calculating the CR for the result of each operation (or marking it as unknown if we cannot handle the operation), using simplification rules such as\begin{align}c * \{\phi_0, +, \phi_1\} & \phantom{xx} \Rightarrow \phantom{xx} \{c * \phi_0, +, c * \phi_1\} \\ \{\phi_0, +, \phi_1\} + \{\psi_0, +, \psi_1\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0 + \psi_0, +, \phi_1 + \psi_1\} \\ \{\phi_0, +, \phi_1\}* \{\psi_0, +, \psi_1\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0 * \psi_0, +, \psi_1 * \{\phi_0, +, \phi_1\} + \phi_1 * \{\psi_0, +, \psi_1\} + \phi_1*\psi_1\} \\ \{\phi_0, +, \phi_1,+,0\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0, +, \phi_1\}\end{align} So for the loop from the sum function
for (int j = 0; j < count; ++j)
result += j*j;

we start with j which we know has the CR $$\{0,+,1\}$$ per Example 1. This is then used as j*j when calculating result, so we calculate the CR for j*j using the simplification rules as \begin{align}j*j& = \{0,+,1\} * \{0,+,1\} \\ & = \{0 * 0, +, 1 * \{0, +,1\} + 1 * \{0, +, 1\} + 1*1\} \\ & = \{0, +, 1,+,2\}\end{align} Similar calculations for result gives us the CR $$\{0,+,0,+,1,+,2\}$$ for the value at the beginning of the iteration, and $$\{0,+,1,+,3,+,2\}$$ after adding j*j.

## Doing the optimization

The optimization is done during induction variable simplification, and LLVM has transformed the function to a form more convenient for analysis and optimization
int sum(int count)
{
int result = 0;

if (count > 0) {
int j = 0;
do {
result = result + j*j;
++j;
} while (j < count);
}

return result;
}

or as it looks like in the LLVM IR
define i32 @sum(i32) {
%2 = icmp sgt i32 %0, 0
br i1 %2, label %3, label %6

; <label>:3:
br label %8

; <label>:4:
%5 = phi i32 [ %12, %8 ]
br label %6

; <label>:6:
%7 = phi i32 [ 0, %1 ], [ %5, %4 ]
ret i32 %7

; <label>:8:
%9 = phi i32 [ %13, %8 ], [ 0, %3 ]     ; {0,+,1}
%10 = phi i32 [ %12, %8 ], [ 0, %3 ]    ; {0,+,0,+,1,+,2}
%11 = mul nsw i32 %9, %9                ; {0,+,1,+,2}
%12 = add nuw nsw i32 %11, %10          ; {0,+,1,+,3,+,2}
%13 = add nuw nsw i32 %9, 1             ; {1,+,1}
%14 = icmp slt i32 %13, %0
br i1 %14, label %8, label %4
}

The compiler can see that the function returns 0 if count <= 0, otherwise it returns the value of result at loop iteration count-1.

One nice property of the chains of recurrences is that it is easy to calculate the value at a specific iteration – if we have a CR $$\{\phi_0,+,\phi_1,+,\ldots,+,\phi_n\}$$, then the value at iteration $$i$$ can be calculated as \begin{align}f(i) & = \sum_{j=0}^{n}\phi_j{i \choose j} \\ & = \phi_0 + \phi_1i + \phi_2{i(i-1)\over 2!} + \ldots + \phi_n{i(i-1)\cdots(i-n+1)\over n!}\end{align} Inserting the values for the CR $$\{0,+,1,+,3,+,2\}$$ describing result gives us $f(i) = i + {3i(i-1)\over 2} + {i(i-1)(i-2) \over 3}$ The compiler now only need to insert code that calculates this with $$i =$$ count-1, after the loop
result = count-1 + 3*(count-1)*(count-2)/2 + (count-1)*(count-2)(count-3)/3;

but it need to take some care to calculate in the correct precision (as temporary values may not fit in 32-bit integers). Integer division is slow, so it is also doing tricks to replace the divisions with multiplication and shift instructions. The result is the LLVM IR
  %4 = add i32 %0, -1
%5 = zext i32 %4 to i33
%6 = add i32 %0, -2
%7 = zext i32 %6 to i33
%8 = mul i33 %5, %7
%9 = add i32 %0, -3
%10 = zext i32 %9 to i33
%11 = mul i33 %8, %10
%12 = lshr i33 %11, 1
%13 = trunc i33 %12 to i32
%14 = mul i32 %13, 1431655766
%15 = add i32 %14, %0
%16 = lshr i33 %8, 1
%17 = trunc i33 %16 to i32
%18 = mul i32 %17, 3
%19 = add i32 %15, %18
%20 = add i32 %19, -1

Inserting this makes the loop become dead, so it is later removed by dead code elimination, and we eventually end up with the code
sum(int):
test    edi, edi
jle     .LBB0_1
lea     eax, [rdi - 1]
lea     ecx, [rdi - 2]
imul    rcx, rax
lea     eax, [rdi - 3]
imul    rax, rcx
shr     rax
imul    eax, eax, 1431655766
shr     rcx
lea     ecx, [rcx + 2*rcx]
lea     eax, [rax + rcx]
ret
.LBB0_1:
xor     eax, eax
ret


## Performance

This optimization is not always profitable. For example,
int sum(int count)
{
int result = 0;

for (int j = 0; j < count; ++j)
result += j*j*j*j*j*j;

return result;
}
is calculated by three 32-bit multiplications and one addition within the loop, while the optimized version needs six 64-bit multiplications, five 32-bit multiplications, and a slew of other instructions (godbolt), so the optimized version is slower for small values of count. I benchmarked on my PC, and count must be larger than 5 for the optimized version to be faster than the loop. Smaller CPUs with, for example, more expensive 64-bit multiplication, will need a higher count for the optimization to help. And CPUs not having instructions for 64-bit multiplication (godbolt) will need a much higher count.

One problem with this optimization is that it is hard for developers to make the compiler generate a loop for this if they know that the majority of values used in reality are small enough for the loop to be the fastest option. GCC does, therefore, not replace the final value of a loop if the expression is expensive
/* Do not emit expensive expressions.  The rationale is that
when someone writes a code like

while (n > 45) n -= 45;

he probably knows that n is not large, and does not want it
to be turned into n %= 45.  */
|| expression_expensive_p (def))

So GCC not doing this optimization is a feature – not a bug.