LLVM optimizes power sums, such as
int sum(int count) { int result = 0; for (int j = 0; j < count; ++j) result += j*j; return result; }to code that calculates the result without a loop (godbolt)
sum(int): test edi, edi jle .LBB0_1 lea eax, [rdi - 1] lea ecx, [rdi - 2] imul rcx, rax lea eax, [rdi - 3] imul rax, rcx shr rax imul eax, eax, 1431655766 add eax, edi shr rcx lea ecx, [rcx + 2*rcx] lea eax, [rax + rcx] add eax, -1 ret .LBB0_1: xor eax, eax retIt handles more complex cases too (godbolt) – that is, the optimization is not just a simple pattern matcher. This post will show how the optimization is done.
Loop analysis – scalar evolution
There are many cases where compilers need to track how values are updated within a loop. For example, the loop vectorizer needs to check that the pointers are moved to the adjacent element in the next iteration, and check that no other pointer indexing may alias the range we are vectorizing.
Both GCC and LLVM does this in the same way in their scalar evolution passes, where each variable at iteration \(i\) (we start enumerating iterations from \(0\)) is represented as a function \(f_0(i)\) defined as a linear recurrence of the form
\[f_j(i) =
\begin{cases}
\phi_j & \text{if $i = 0$} \\
f_j(i-1) \odot_{j+1} f_{j+1}(i-1) & \text{if $i > 0$}
\end{cases}\] where \(\odot\in\{+,*\}\).
\begin{cases}
0 & \text{if $i = 0$} \\
f(i-1) + 1 & \text{if $i > 0$}
\end{cases}\]
\begin{cases}
2\phantom{f_0(i-1) + f_1(i-1)} & \text{if $i = 0$} \\
f_2(i-1) + 6 & \text{if $i > 0$}
\end{cases}\\
f_1(i) & =
\begin{cases}
k-1 & \text{if $i = 0$} \\
f_1(i-1) + f_2(i-1)\phantom{2} & \text{if $i > 0$}
\end{cases}\\
f(i) = f_0(i) & =
\begin{cases}
7 & \text{if $i = 0$} \\
f_0(i-1) + f_1(i-1)\phantom{2} & \text{if $i > 0$}
\end{cases}\end{align}\] One optimization we can see directly from these functions is that the value can be calculated by just three additions within the loop
\begin{cases}
2\phantom{f_0(i-1) + f_1(i-1)} & \text{if $i = 0$} \\
f_2(i-1) + 6 & \text{if $i > 0$}
\end{cases} \phantom{xx}\text{is written as $\{2,+,6\}$}\\
f_1(i) & =
\begin{cases}
k-1 & \text{if $i = 0$} \\
f_1(i-1) + f_2(i-1)\phantom{2} & \text{if $i > 0$}
\end{cases} \phantom{xx}\text{is written as $\{k-1,+,f_2\}$}\\
f(i) = f_0(i) & =
\begin{cases}
7 & \text{if $i = 0$} \\
f_0(i-1) + f_1(i-1)\phantom{2} & \text{if $i > 0$}
\end{cases} \phantom{xx}\text{is written as $\{7,+,f_1\}$}\end{align}\] These can be chained, so \(f(i)\) can be written as a chain of recurrences (CR) \(\{7,+,\{k-1,+,\{2,+,6\}\}\}\). The inner curly braces are redundant, so the CR is usually written as a tuple \(\{7,+,k-1,+,2,+,6\}\).
\{\phi_0, +, \phi_1\} + \{\psi_0, +, \psi_1\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0 + \psi_0, +, \phi_1 + \psi_1\} \\
\{\phi_0, +, \phi_1\}* \{\psi_0, +, \psi_1\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0 * \psi_0, +, \psi_1 * \{\phi_0, +, \phi_1\} + \phi_1 * \{\psi_0, +, \psi_1\} + \phi_1*\psi_1\} \\
\{\phi_0, +, \phi_1,+,0\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0, +, \phi_1\}\end{align}\] So for the loop from the
& = \{0 * 0, +, 1 * \{0, +,1\} + 1 * \{0, +, 1\} + 1*1\} \\
& = \{0, +, 1,+,2\}\end{align}\] Similar calculations for
\[f_j(i) =
\begin{cases}
\phi_j & \text{if $i = 0$} \\
f_j(i-1) \odot_{j+1} f_{j+1}(i-1) & \text{if $i > 0$}
\end{cases}\] where \(\odot\in\{+,*\}\).
Example 1
The simplest case is a loop such asvoid foo(int m, int *p) { for (int j = 0; j < m; j++) *p++ = j; }The loop writes \(0\) to
*p++
in the first iteration, \(1\) in the second, etc. So we can express the value written at iteration \(i\) as\[f(i) =\begin{cases}
0 & \text{if $i = 0$} \\
f(i-1) + 1 & \text{if $i > 0$}
\end{cases}\]
Example 2
Polynomials in the iteration variable can also be expressed in this form.void foo(int m, int k, int *p) { for (int j = 0; < m; j++) *p++ = j*j*j - 2*j*j + k*j + 7; }We will see below how to build the functions, but the result for the value stored in this loop is \[\begin{align}f_2(i) & =
\begin{cases}
2\phantom{f_0(i-1) + f_1(i-1)} & \text{if $i = 0$} \\
f_2(i-1) + 6 & \text{if $i > 0$}
\end{cases}\\
f_1(i) & =
\begin{cases}
k-1 & \text{if $i = 0$} \\
f_1(i-1) + f_2(i-1)\phantom{2} & \text{if $i > 0$}
\end{cases}\\
f(i) = f_0(i) & =
\begin{cases}
7 & \text{if $i = 0$} \\
f_0(i-1) + f_1(i-1)\phantom{2} & \text{if $i > 0$}
\end{cases}\end{align}\] One optimization we can see directly from these functions is that the value can be calculated by just three additions within the loop
void foo(int m, int k, int *p) { int t0 = 7; int t1 = k-1; int t2 = 2; for (int j = 0; j < m; j++) { *p++ = t0; t0 = t0 + t1; t1 = t1 + t2; t2 = t2 + 6; } }which is a useful optimization for architectures were multiplication is expensive. This kind of code is, however, uncommon, so most compilers do not do this optimzation – but they usually do this for simpler cases, such as
void foo(int m, int k, int *p) { for (int j = 0; < m; j++) *p++ = k*j + 7; }as constructs of the form
k*j+7
are common in address calculations.Chains of recurrences
It is cumbersome to write the recursive functions all the time, so the functions are usually written in the form \(\{\phi_j, \odot_{j+1}, f_{j+1}\}\). For example \[\begin{align}f_2(i) & =\begin{cases}
2\phantom{f_0(i-1) + f_1(i-1)} & \text{if $i = 0$} \\
f_2(i-1) + 6 & \text{if $i > 0$}
\end{cases} \phantom{xx}\text{is written as $\{2,+,6\}$}\\
f_1(i) & =
\begin{cases}
k-1 & \text{if $i = 0$} \\
f_1(i-1) + f_2(i-1)\phantom{2} & \text{if $i > 0$}
\end{cases} \phantom{xx}\text{is written as $\{k-1,+,f_2\}$}\\
f(i) = f_0(i) & =
\begin{cases}
7 & \text{if $i = 0$} \\
f_0(i-1) + f_1(i-1)\phantom{2} & \text{if $i > 0$}
\end{cases} \phantom{xx}\text{is written as $\{7,+,f_1\}$}\end{align}\] These can be chained, so \(f(i)\) can be written as a chain of recurrences (CR) \(\{7,+,\{k-1,+,\{2,+,6\}\}\}\). The inner curly braces are redundant, so the CR is usually written as a tuple \(\{7,+,k-1,+,2,+,6\}\).
Building the chains of recurrences
The chains of recurrences are built by iterating over the code and calculating the CR for the result of each operation (or marking it as unknown if we cannot handle the operation), using simplification rules such as\[\begin{align}c * \{\phi_0, +, \phi_1\} & \phantom{xx} \Rightarrow \phantom{xx} \{c * \phi_0, +, c * \phi_1\} \\\{\phi_0, +, \phi_1\} + \{\psi_0, +, \psi_1\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0 + \psi_0, +, \phi_1 + \psi_1\} \\
\{\phi_0, +, \phi_1\}* \{\psi_0, +, \psi_1\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0 * \psi_0, +, \psi_1 * \{\phi_0, +, \phi_1\} + \phi_1 * \{\psi_0, +, \psi_1\} + \phi_1*\psi_1\} \\
\{\phi_0, +, \phi_1,+,0\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0, +, \phi_1\}\end{align}\] So for the loop from the
sum
functionfor (int j = 0; j < count; ++j) result += j*j;we start with
j
which we know has the CR \(\{0,+,1\}\) per Example 1. This is then used as j*j
when calculating result
, so we calculate the CR for j*j
using the simplification rules as \[\begin{align}j*j& = \{0,+,1\} * \{0,+,1\} \\& = \{0 * 0, +, 1 * \{0, +,1\} + 1 * \{0, +, 1\} + 1*1\} \\
& = \{0, +, 1,+,2\}\end{align}\] Similar calculations for
result
gives us the CR \(\{0,+,0,+,1,+,2\}\) for the value at the beginning of the iteration, and \(\{0,+,1,+,3,+,2\}\) after adding j*j
.Doing the optimization
The optimization is done during induction variable simplification, and LLVM has transformed the function to a form more convenient for analysis and optimization
One nice property of the chains of recurrences is that it is easy to calculate the value at a specific iteration – if we have a CR \(\{\phi_0,+,\phi_1,+,\ldots,+,\phi_n\}\), then the value at iteration \(i\) can be calculated as \[\begin{align}f(i) & = \sum_{j=0}^{n}\phi_j{i \choose j} \\ & = \phi_0 + \phi_1i + \phi_2{i(i-1)\over 2!} + \ldots + \phi_n{i(i-1)\cdots(i-n+1)\over n!}\end{align}\] Inserting the values for the CR \(\{0,+,1,+,3,+,2\}\) describing
One problem with this optimization is that it is hard for developers to make the compiler generate a loop for this if they know that the majority of values used in reality are small enough for the loop to be the fastest option. GCC does, therefore, not replace the final value of a loop if the expression is expensive
int sum(int count) { int result = 0; if (count > 0) { int j = 0; do { result = result + j*j; ++j; } while (j < count); } return result; }or as it looks like in the LLVM IR
define i32 @sum(i32) { %2 = icmp sgt i32 %0, 0 br i1 %2, label %3, label %6 ; <label>:3: br label %8 ; <label>:4: %5 = phi i32 [ %12, %8 ] br label %6 ; <label>:6: %7 = phi i32 [ 0, %1 ], [ %5, %4 ] ret i32 %7 ; <label>:8: %9 = phi i32 [ %13, %8 ], [ 0, %3 ] ; {0,+,1} %10 = phi i32 [ %12, %8 ], [ 0, %3 ] ; {0,+,0,+,1,+,2} %11 = mul nsw i32 %9, %9 ; {0,+,1,+,2} %12 = add nuw nsw i32 %11, %10 ; {0,+,1,+,3,+,2} %13 = add nuw nsw i32 %9, 1 ; {1,+,1} %14 = icmp slt i32 %13, %0 br i1 %14, label %8, label %4 }The compiler can see that the function returns
0
if count <= 0
, otherwise it returns the value of result
at loop iteration count-1
.One nice property of the chains of recurrences is that it is easy to calculate the value at a specific iteration – if we have a CR \(\{\phi_0,+,\phi_1,+,\ldots,+,\phi_n\}\), then the value at iteration \(i\) can be calculated as \[\begin{align}f(i) & = \sum_{j=0}^{n}\phi_j{i \choose j} \\ & = \phi_0 + \phi_1i + \phi_2{i(i-1)\over 2!} + \ldots + \phi_n{i(i-1)\cdots(i-n+1)\over n!}\end{align}\] Inserting the values for the CR \(\{0,+,1,+,3,+,2\}\) describing
result
gives us \[f(i) = i + {3i(i-1)\over 2} + {i(i-1)(i-2) \over 3}\] The compiler now only need to insert code that calculates this with \(i =\) count-1
, after the loopresult = count-1 + 3*(count-1)*(count-2)/2 + (count-1)*(count-2)(count-3)/3;but it need to take some care to calculate in the correct precision (as temporary values may not fit in 32-bit integers). Integer division is slow, so it is also doing tricks to replace the divisions with multiplication and shift instructions. The result is the LLVM IR
%4 = add i32 %0, -1 %5 = zext i32 %4 to i33 %6 = add i32 %0, -2 %7 = zext i32 %6 to i33 %8 = mul i33 %5, %7 %9 = add i32 %0, -3 %10 = zext i32 %9 to i33 %11 = mul i33 %8, %10 %12 = lshr i33 %11, 1 %13 = trunc i33 %12 to i32 %14 = mul i32 %13, 1431655766 %15 = add i32 %14, %0 %16 = lshr i33 %8, 1 %17 = trunc i33 %16 to i32 %18 = mul i32 %17, 3 %19 = add i32 %15, %18 %20 = add i32 %19, -1Inserting this makes the loop become dead, so it is later removed by dead code elimination, and we eventually end up with the code
sum(int): test edi, edi jle .LBB0_1 lea eax, [rdi - 1] lea ecx, [rdi - 2] imul rcx, rax lea eax, [rdi - 3] imul rax, rcx shr rax imul eax, eax, 1431655766 add eax, edi shr rcx lea ecx, [rcx + 2*rcx] lea eax, [rax + rcx] add eax, -1 ret .LBB0_1: xor eax, eax ret
Performance
This optimization is not always profitable. For example,int sum(int count) { int result = 0; for (int j = 0; j < count; ++j) result += j*j*j*j*j*j; return result; }is calculated by three 32-bit multiplications and one addition within the loop, while the optimized version needs six 64-bit multiplications, five 32-bit multiplications, and a slew of other instructions (godbolt), so the optimized version is slower for small values of
count
.
I benchmarked on my PC, and count
must be larger than 5
for the optimized version to be faster than the loop. Smaller CPUs with, for example, more expensive 64-bit multiplication, will need a higher count
for the optimization to help. And CPUs not having instructions for 64-bit multiplication (godbolt) will need a much higher count
.One problem with this optimization is that it is hard for developers to make the compiler generate a loop for this if they know that the majority of values used in reality are small enough for the loop to be the fastest option. GCC does, therefore, not replace the final value of a loop if the expression is expensive
/* Do not emit expensive expressions. The rationale is that when someone writes a code like while (n > 45) n -= 45; he probably knows that n is not large, and does not want it to be turned into n %= 45. */ || expression_expensive_p (def))So GCC not doing this optimization is a feature – not a bug.
Further readings
Chains of recurrences:- Olaf Bachmann, Paul S. Wang, Eugene V. Zima. “Chains of recurrences – a method to expedite the evaluation of closed-form functions”
- Eugene V. Zima. “On computational properties of chains of recurrences”
- Robert A. van Engelen. “Symbolic Evaluation of Chains of Recurrences for Loop Optimization”
- Robert A. van Engelen. “Efficient Symbolic Analysis for Optimizing Compilers”
- Torbjörn Granlund, Peter L. Montgomery. “Division by Invariant Integers using Multiplication”
Updated: The original post incorrectly called the power sums “geometric sums”.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.