Krister Walfridsson’s old blog: LLVM

LLVM optimizes power sums, such as

int sum(int count)
{
  int result = 0;

  for (int j = 0; j < count; ++j)
    result += j*j;

  return result;
}

to code that calculates the result without a loop (godbolt)

sum(int):
        test    edi, edi
        jle     .LBB0_1
        lea     eax, [rdi - 1]
        lea     ecx, [rdi - 2]
        imul    rcx, rax
        lea     eax, [rdi - 3]
        imul    rax, rcx
        shr     rax
        imul    eax, eax, 1431655766
        add     eax, edi
        shr     rcx
        lea     ecx, [rcx + 2*rcx]
        lea     eax, [rax + rcx]
        add     eax, -1
        ret
.LBB0_1:
        xor     eax, eax
        ret

It handles more complex cases too (godbolt) – that is, the optimization is not just a simple pattern matcher. This post will show how the optimization is done.

Loop analysis – scalar evolution

There are many cases where compilers need to track how values are updated within a loop. For example, the loop vectorizer needs to check that the pointers are moved to the adjacent element in the next iteration, and check that no other pointer indexing may alias the range we are vectorizing.

Both GCC and LLVM does this in the same way in their scalar evolution passes, where each variable at iteration $i$ (we start enumerating iterations from $0$) is represented as a function $f_0(i)$ defined as a linear recurrence of the form
\[f_j(i) =
\begin{cases}
\phi_j & \text{if $i = 0$} \\
f_j(i-1) \odot_{j+1} f_{j+1}(i-1) & \text{if $i > 0$}
\end{cases}\] where $\odot\in\{+,*\}$.

Example 1

The simplest case is a loop such as

void foo(int m, int *p)
{
  for (int j = 0; j < m; j++)
    *p++ = j;
}

The loop writes $0$ to *p++ in the first iteration, $1$ in the second, etc. So we can express the value written at iteration $i$ as\[f(i) =
\begin{cases}
0 & \text{if $i = 0$} \\
f(i-1) + 1 & \text{if $i > 0$}
\end{cases}\]

Example 2

Polynomials in the iteration variable can also be expressed in this form.

void foo(int m, int k, int *p)
{
  for (int j = 0; < m; j++)
    *p++ = j*j*j - 2*j*j + k*j + 7;
}

We will see below how to build the functions, but the result for the value stored in this loop is \[\begin{align}f_2(i) & =
\begin{cases}
2\phantom{f_0(i-1) + f_1(i-1)} & \text{if $i = 0$} \\
f_2(i-1) + 6 & \text{if $i > 0$}
\end{cases}\\
f_1(i) & =
\begin{cases}
k-1 & \text{if $i = 0$} \\
f_1(i-1) + f_2(i-1)\phantom{2} & \text{if $i > 0$}
\end{cases}\\
f(i) = f_0(i) & =
\begin{cases}
7 & \text{if $i = 0$} \\
f_0(i-1) + f_1(i-1)\phantom{2} & \text{if $i > 0$}
\end{cases}\end{align}\] One optimization we can see directly from these functions is that the value can be calculated by just three additions within the loop

void foo(int m, int k, int *p)
{
  int t0 = 7;
  int t1 = k-1;
  int t2 = 2;
  for (int j = 0; j < m; j++) {
    *p++ = t0;
    t0 = t0 + t1;
    t1 = t1 + t2;
    t2 = t2 + 6;
  }
}

which is a useful optimization for architectures were multiplication is expensive. This kind of code is, however, uncommon, so most compilers do not do this optimzation – but they usually do this for simpler cases, such as

void foo(int m, int k, int *p)
{
  for (int j = 0; < m; j++)
    *p++ = k*j + 7;
}

as constructs of the form k*j+7 are common in address calculations.

Chains of recurrences

It is cumbersome to write the recursive functions all the time, so the functions are usually written in the form $\{\phi_j, \odot_{j+1}, f_{j+1}\}$. For example \[\begin{align}f_2(i) & =
\begin{cases}
2\phantom{f_0(i-1) + f_1(i-1)} & \text{if $i = 0$} \\
f_2(i-1) + 6 & \text{if $i > 0$}
\end{cases} \phantom{xx}\text{is written as $\{2,+,6\}$}\\
f_1(i) & =
\begin{cases}
k-1 & \text{if $i = 0$} \\
f_1(i-1) + f_2(i-1)\phantom{2} & \text{if $i > 0$}
\end{cases} \phantom{xx}\text{is written as $\{k-1,+,f_2\}$}\\
f(i) = f_0(i) & =
\begin{cases}
7 & \text{if $i = 0$} \\
f_0(i-1) + f_1(i-1)\phantom{2} & \text{if $i > 0$}
\end{cases} \phantom{xx}\text{is written as $\{7,+,f_1\}$}\end{align}\] These can be chained, so $f(i)$ can be written as a chain of recurrences (CR) $\{7,+,\{k-1,+,\{2,+,6\}\}\}$. The inner curly braces are redundant, so the CR is usually written as a tuple $\{7,+,k-1,+,2,+,6\}$.

Building the chains of recurrences

The chains of recurrences are built by iterating over the code and calculating the CR for the result of each operation (or marking it as unknown if we cannot handle the operation), using simplification rules such as\[\begin{align}c * \{\phi_0, +, \phi_1\} & \phantom{xx} \Rightarrow \phantom{xx} \{c * \phi_0, +, c * \phi_1\} \\
\{\phi_0, +, \phi_1\} + \{\psi_0, +, \psi_1\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0 + \psi_0, +, \phi_1 + \psi_1\} \\
\{\phi_0, +, \phi_1\}* \{\psi_0, +, \psi_1\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0 * \psi_0, +, \psi_1 * \{\phi_0, +, \phi_1\} + \phi_1 * \{\psi_0, +, \psi_1\} + \phi_1*\psi_1\} \\
\{\phi_0, +, \phi_1,+,0\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0, +, \phi_1\}\end{align}\] So for the loop from the sum function

for (int j = 0; j < count; ++j)
  result += j*j;

we start with j which we know has the CR $\{0,+,1\}$ per Example 1. This is then used as j*j when calculating result, so we calculate the CR for j*j using the simplification rules as \[\begin{align}j*j& = \{0,+,1\} * \{0,+,1\} \\
& = \{0 * 0, +, 1 * \{0, +,1\} + 1 * \{0, +, 1\} + 1*1\} \\
& = \{0, +, 1,+,2\}\end{align}\] Similar calculations for result gives us the CR $\{0,+,0,+,1,+,2\}$ for the value at the beginning of the iteration, and $\{0,+,1,+,3,+,2\}$ after adding j*j.

Doing the optimization

The optimization is done during induction variable simplification, and LLVM has transformed the function to a form more convenient for analysis and optimization

int sum(int count)
{
  int result = 0;

  if (count > 0) {
    int j = 0;
    do {
      result = result + j*j;
      ++j;
    } while (j < count);
  }

  return result;
}

or as it looks like in the LLVM IR

define i32 @sum(i32) {
  %2 = icmp sgt i32 %0, 0
  br i1 %2, label %3, label %6

; <label>:3:
  br label %8

; <label>:4:
  %5 = phi i32 [ %12, %8 ]
  br label %6

; <label>:6:
  %7 = phi i32 [ 0, %1 ], [ %5, %4 ]
  ret i32 %7

; <label>:8:
  %9 = phi i32 [ %13, %8 ], [ 0, %3 ]     ; {0,+,1}
  %10 = phi i32 [ %12, %8 ], [ 0, %3 ]    ; {0,+,0,+,1,+,2}
  %11 = mul nsw i32 %9, %9                ; {0,+,1,+,2}
  %12 = add nuw nsw i32 %11, %10          ; {0,+,1,+,3,+,2}
  %13 = add nuw nsw i32 %9, 1             ; {1,+,1}
  %14 = icmp slt i32 %13, %0
  br i1 %14, label %8, label %4
}

The compiler can see that the function returns 0 if count <= 0, otherwise it returns the value of result at loop iteration count-1.

One nice property of the chains of recurrences is that it is easy to calculate the value at a specific iteration – if we have a CR $\{\phi_0,+,\phi_1,+,\ldots,+,\phi_n\}$, then the value at iteration $i$ can be calculated as \[\begin{align}f(i) & = \sum_{j=0}^{n}\phi_j{i \choose j} \\ & = \phi_0 + \phi_1i + \phi_2{i(i-1)\over 2!} + \ldots + \phi_n{i(i-1)\cdots(i-n+1)\over n!}\end{align}\] Inserting the values for the CR $\{0,+,1,+,3,+,2\}$ describing result gives us \[f(i) = i + {3i(i-1)\over 2} + {i(i-1)(i-2) \over 3}\] The compiler now only need to insert code that calculates this with $i =$ count-1, after the loop

result = count-1 + 3*(count-1)*(count-2)/2 + (count-1)*(count-2)(count-3)/3;

but it need to take some care to calculate in the correct precision (as temporary values may not fit in 32-bit integers). Integer division is slow, so it is also doing tricks to replace the divisions with multiplication and shift instructions. The result is the LLVM IR

  %4 = add i32 %0, -1
  %5 = zext i32 %4 to i33
  %6 = add i32 %0, -2
  %7 = zext i32 %6 to i33
  %8 = mul i33 %5, %7
  %9 = add i32 %0, -3
  %10 = zext i32 %9 to i33
  %11 = mul i33 %8, %10
  %12 = lshr i33 %11, 1
  %13 = trunc i33 %12 to i32
  %14 = mul i32 %13, 1431655766
  %15 = add i32 %14, %0
  %16 = lshr i33 %8, 1
  %17 = trunc i33 %16 to i32
  %18 = mul i32 %17, 3
  %19 = add i32 %15, %18
  %20 = add i32 %19, -1

Inserting this makes the loop become dead, so it is later removed by dead code elimination, and we eventually end up with the code

sum(int):
        test    edi, edi
        jle     .LBB0_1
        lea     eax, [rdi - 1]
        lea     ecx, [rdi - 2]
        imul    rcx, rax
        lea     eax, [rdi - 3]
        imul    rax, rcx
        shr     rax
        imul    eax, eax, 1431655766
        add     eax, edi
        shr     rcx
        lea     ecx, [rcx + 2*rcx]
        lea     eax, [rax + rcx]
        add     eax, -1
        ret
.LBB0_1:
        xor     eax, eax
        ret

Performance

This optimization is not always profitable. For example,

int sum(int count)
{
  int result = 0;

  for (int j = 0; j < count; ++j)
    result += j*j*j*j*j*j;

  return result;
}

is calculated by three 32-bit multiplications and one addition within the loop, while the optimized version needs six 64-bit multiplications, five 32-bit multiplications, and a slew of other instructions (godbolt), so the optimized version is slower for small values of count. I benchmarked on my PC, and count must be larger than 5 for the optimized version to be faster than the loop. Smaller CPUs with, for example, more expensive 64-bit multiplication, will need a higher count for the optimization to help. And CPUs not having instructions for 64-bit multiplication (godbolt) will need a much higher count.

One problem with this optimization is that it is hard for developers to make the compiler generate a loop for this if they know that the majority of values used in reality are small enough for the loop to be the fastest option. GCC does, therefore, not replace the final value of a loop if the expression is expensive

/* Do not emit expensive expressions.  The rationale is that
   when someone writes a code like

   while (n > 45) n -= 45;

   he probably knows that n is not large, and does not want it
   to be turned into n %= 45.  */
|| expression_expensive_p (def))

So GCC not doing this optimization is a feature – not a bug.

Further readings

Chains of recurrences:

Olaf Bachmann, Paul S. Wang, Eugene V. Zima. “Chains of recurrences – a method to expedite the evaluation of closed-form functions”
Eugene V. Zima. “On computational properties of chains of recurrences”

Loop optimizations using chains of recurrences:

Robert A. van Engelen. “Symbolic Evaluation of Chains of Recurrences for Loop Optimization”
Robert A. van Engelen. “Efficient Symbolic Analysis for Optimizing Compilers”

Optimizing division using multiplication and shift instructions:

Torbjörn Granlund, Peter L. Montgomery. “Division by Invariant Integers using Multiplication”

Updated: The original post incorrectly called the power sums “geometric sums”.

I saw on twitter that it takes more than twice as much time compiling the Ogre graphics engine using GCC than when using Clang. My experience is that GCC and Clang usually compile with similar speed, so I decided to look into why compiling Ogre is different.

It turned out that a big part of the difference comes from which C++ version the compilers use per default – clang-5.0 defaults to C++98, while gcc-7 defaults to a newer version. Forcing the compilers to use C++98 by passing -std=c++98 makes GCC compile Ogre in about half the time (668s vs. 1135s), while passing -std=c++11 nearly doubles the time Clang needs to compile it!

One reason for this difference is that some of the standard include files are more expensive in C++11 mode as they suck in more dependencies. For example, compiling a file containing just the line

#include <memory>

takes 0.16 seconds on my computer when using C++11

> time -p g++ -O2 -c test.cpp -std=c++11
real 0.16
user 0.14
sys 0.01

while compiling it as C++98 is faster

> time -p g++ -O2 -c test.cpp -std=c++98
real 0.02
user 0.01
sys 0.01

The 0.14-second difference may not seem that big, but it makes a difference when, as for Ogre, you are compiling more than 500 files, each taking about one second. The increased cost of including standard header files for C++11 compared to C++98 adds about 20% to the Ogre build time.

It is a bit unclear to me exactly where the rest of the slowdown comes from, but it seems to be spread all over the code (I tried to remove various classes in the Ogre code base, and removing 10% of the source code seems to affect both the fast and slow version by about 10%) so I assume this is just because the templates in the C++11 STL are more complex and the compiler needs to work a bit harder each time they are used...

Anyway, the difference in compilation time between -std=c++98 and -std=c++11 was much bigger than I had guessed, and I’ll now ensure I use -std=c++98 when building C++98 code.

Updated: The original blog post said that gcc-7 uses C++11 per default. That was wrong, it defaults to C++14.

Krister Walfridsson’s old blog

Sunday, April 28, 2019

How LLVM optimizes power sums

Loop analysis – scalar evolution

Example 1

Example 2

Chains of recurrences

Building the chains of recurrences

Doing the optimization

Performance

Further readings

Thursday, April 12, 2018

Compilation time – -std=c++98 vs. -std=c++11

Friday, September 8, 2017

Follow-up on “Why undefined behavior may call a never-called function”