## Tuesday, May 30, 2017

### Interprocedural optimization in GCC

Compilers can do a better job optimizing a function if they can use knowledge of other functions. The obvious case is inlining, but there are many more cases. This post lists the interprocedural optimizations implemented in GCC 7.

Many of the optimizations are only relevant for large functions (small functions are inlined into the caller!) or for helping other optimization passes. This makes it hard to give relevant examples, so the examples in this post are just illustrating the principles.

### Parameter passing

Parameter passing for functions where GCC can see all callers (such as functions that are local to a translating unit, or when the whole program is compiled using link-time optimization) is optimized as
• Unused parameters are removed.
• Parameters passed  by reference may be changed to be passed by value. For example,
static int foo(int *m)
{
return *m + 1;
}

int bar(void)
{
int i = 1;
return foo(&i);
}

is changed to
static int foo(int m)
{
return m + 1;
}

int bar(void)
{
int i = 1;
return foo(i);
}

which makes it much easier for other optimization passes to reason about the variables.
• A structure may be split into its elements. For example,
struct bovid
{
float red;
int green;
void *blue;
};

static void ox(struct bovid *cow)
{
cow->red = cow->red + cow->green;
}

int main(void)
{
struct bovid cow;

cow.red = 7.4;
cow.green = 6;
cow.blue = &cow;

ox(&cow);

return 0;
}

is changed to
struct bovid
{
float red;
int green;
void *blue;
};

static void ox(float *t1, int t2)
{
*t1 = *t1 + t2;
}

int main(void)
{
struct bovid cow;

cow.red = 7.4;
cow.green = 6;
cow.blue = &cow;

ox(&cow.red, cow.green);

return 0;
}

These optimizations are enabled by -fipa-sra, which is enabled by default at -Os, -O2, and -O3.

### Constant propagation

Functions where all callers pass the same constant can be optimized by propagating the constant into the function. That is,
static int foo(int a, int b)
{
if (b > 0)
return a + b;
else
return a * b;
}

int bar(int m, int n)
{
return foo(m, 7) + foo(n, 7);
}

is optimized to
static int foo(int a)
{
return a + 7;
}

int bar(int m, int n)
{
return foo(m) + foo(n);
}


The constants can be propagated bitwise, which is useful for flag parameters. For example
static int foo(int a, int b)
{
if (b & 4)
return a & (b & 1);
else
return a & (b & 2);
}

int bar(int m, int n)
{
return foo(m, 9) | foo(n, 3);
}

is optimized to
static int foo(int a, int b)
{
return a & (b & 2);
}

int bar(int m, int n)
{
return foo(m, 9) | foo(n, 3);
}


The constants do not need to be the same in all function calls – GCC tracks ranges of possible values and optimize as appropriate, so
static int foo(int a, int b)
{
if (b > 0)
return a + b;
else
return a * b;
}

int bar(int m, int n)
{
return foo(m, 5) + foo(n, 7);
}

is optimized to
static int foo(int a, int b)
{
return a + b;
}

int bar(int m, int n)
{
return foo(m, 5) + foo(n, 7);
}

as both 5 and 7 are greater than 0.

These optimizations are enabled by -fipa-cp, -fipa-bit-cp, and -fipa-vrp, which are enabled by default at -Os, -O2, and -O3.

### Constant propagation – cloning

It is often the case that only a few of the function calls pass constants as parameters, or that the constants are conflicting so they cannot be propagated into the called function. GCC handles this by cloning the called function to let each conflicting call get its own version. For example,
static int foo(int a, int b)
{
if (b > 0)
return a + b;
else
return a * b;
}

int bar(int m, int n)
{
return foo(m, 5) + foo(m, n);
}

creates one clone of foo and optimizes it using the constant 5 for the parameter b
static int foo(int a, int b)
{
if (b > 0)
return a + b;
else
return a * b;
}

static int foo_clone(int a)
{
return a + 5;
}

int bar(int m, int n)
{
return foo_clone(m) + foo(m, n);
}


This optimization is enabled by -fipa-cp-clone, which is enabled by default at -O3.

### Devirtualization

Devirtualization (converting calls to virtual functions to direct calls – see Jan Hubička's blog series on how devirtualization works in GCC) is helped by propagating type information in roughly the same way as the constants are propagated, and is implemented by the constant propagation pass.

This is enabled by -fipa-cp and -fdevirtualize, which are enabled by default at -Os, -O2, and -O3.

### Caller-saved registers

Caller saved registers do not need to be saved if those registers are not used by the called function.

This optimization is enabled by -fipa-ra, which is enabled by default at -Os, -O2, and -O3.

### Identical code folding

The “identical code folding pass” merges identical functions. The functions do not need to be identical in the source code – the merging is done halfway through the optimization pipeline so it is enough that they have the same structure after simplification (and variable names etc. does not matter).

Functions that may be used outside the compilation unit cannot be completely merged as the C and C++ standards require that functions have unique addresses. GCC solves this by adding wrappers for the exported symbols, so that
#include <stdio.h>

void foo(char *s)
{
printf("Hello %s\n", s);
}

void bar(char *s)
{
printf("Hello %s\n", s);
}

is generated as
.LC0:
.string "Hello %s\n"

foo:
mov     rsi, rdi
xor     eax, eax
mov     edi, OFFSET FLAT:.LC0
jmp     printf

bar:
jmp     foo


This optimization is enabled by -fipa-icf, which is enabled by default at -Os, -O2, and -O3.

### Profile propagation

Many optimizations have different heuristics depending on how much the code is executed. The compiler estimates branch frequencies and propagates this information between functions so that, for example, a function only called from “cold” code segments is treated as a “cold” function.

This is enabled by -fipa-profile, which is enabled by default at -O and higher.

### Pure, const, and noexcept

GCC analyzes functions to determine if they access memory or may throw exceptions, propagates this information throughout the compilation unit, and annotates the functions with pure, const, and noexcept attributes when possible, which helps other optimizations.

This optimization is enabled by -fipa-pure-const, which is enabled by default at -O and higher.

### Global variables

It is in general hard to optimize usage of global variables, but it is easy to improve usage of global variables that cannot escape the compilation unit and that do not have the address taken. There are three optimizations done on such variables
• Removal of global variables that are never read.
• A global variable that is used in only one function may be changed to a local variable in that function.
• The compiler tracks which functions modifies the variables so that loads and stores may be moved over function calls that do not touch the variable. For example, the function bar in
static int g;

void foo(void)
{
// Code not touching g
}

int bar(void)
{
g += 1;
foo();
g += 2;
}

is optimized to
int bar(void)
{
foo();
g += 3;
}

These optimizations are enabled by -fipa-reference, which is enabled by default at -O and higher.

### Pointer analysis

GCC can do interprocedural pointer analysis, which is enabled by -fipa-pta. This optimization is not enabled by default at any optimization level as it can cause excessive memory and compile-time usage on large compilation units.

## Sunday, May 21, 2017

### Seeding the std::mt19937 random number engine

A comment on Hacker News complained that the code in my previous blog post does not seed the std::mt19937 random number engine properly. The code was taken directly from a CppCon presentation, so I don’t want to take the blame, but the comment is right — the initialization code can be improved.

### State size and seeding

The initialization in the blog post was done as
std::random_device rd;
std::mt19937 gen(rd());

which seeds the std::mt19937 random number engine with a random 32-bit value. The problem with this is that that the Mersenne twister has 19968 bits of internal state so it can generate $$2^{19968}$$ streams of random values, but we can only reach $$2^{32}$$ of those states when initializing with a 32-bit value.

This is not necessarily a problem. Let’s say the random numbers are used for generating input data in unit tests. The test suite is probably not run more than a few thousand times, so it does not matter that it only can create $$2^{32}$$ different test runs. But there are use-cases where this is a problem.

The random number engine can be seeded with more data by using std::seed_seq, and the code below seeds the std::mt19937 with the same number of bits as are in the state
std::random_device rd;
std::array<int, std::mt19937::state_size> seed_data;
std::generate_n(seed_data.data(), seed_data.size(), std::ref(rd));
std::seed_seq seq(std::begin(seed_data), std::end(seed_data));
std::mt19937 gen(seq);


### std::random_device

One other potential problem is the quality of the seed values. The idea behind std::random_device is that it returns non-deterministic random numbers, but it is allowed to return deterministic values (e.g. if a non-deterministic source is not available to the implementation). I’m not a big fan of this functionality — it either does exactly what you want (generates non-deterministic values) or it does the opposite (generates deterministic values), and there is no way you can determine which.1

This is probably not a problem when developing for the big platforms, but there may be surprises when running the code in other environments — at least old versions of libstdc++ on MinGW always return the same sequence of values...

1. The std::random_device can return an estimate of the entropy, and it is required to return 0 if the values are generated deterministically. But it is not required to return non-zero for the non-deterministic case, and e.g. libstdc++ is conservative and always estimates the entropy as 0, even when /dev/urandom or the x86 RDRND instruction are used.

## Tuesday, May 16, 2017

### Integer division is slow

The CppCon 2016 talk “I Just Wanted a Random Integer!” benchmarks randomization functionality from the C++ standard library (using GCC 5.1). There is one surprising result — the loop
std::random_device entropySource;
std::mt19937 randGenerator(entropySource());
std::uniform_int_distribution<int> theIntDist(0, 99);

for (int i = 0; i < 1000000000; i++) {
volatile auto r = theIntDist(randGenerator);
}

need 23.4 seconds to run while
std::random_device entropySource;
std::mt19937 randGenerator(entropySource());

for (int i = 0; i < 1000000000; i++) {
std::uniform_int_distribution<int> theIntDist(0, 99);
volatile auto r = theIntDist(randGenerator);
}

run in 5.1 seconds. But the latter should intuitively be slower as it does more in the loop...

### Code expansion

The functionality in the standard library is implemented using template magic, but the compiler’s view of the code after inlining and basic simplification is that
std::uniform_int_distribution<int> theIntDist(0, 99);

is just defining and initializing a structure
struct {
int a, b;
} theIntDist;
theIntDist.a = 0;
theIntDist.b = 99;

while the call
volatile auto r = theIntDist(randGenerator);

is expanded to the equivalent of
uint64_t ret;
uint64_t urange = theIntDist.b - theIntDist.a;
if (0xffffffff > urange) {
const uint64_t uerange = urange + 1;
const uint64_t scaling = 0xffffffff / uerange;
const uint64_t past = uerange * scaling;
do {
ret = mersenne_twister_engine(randGenerator);
} while (ret >= past);
ret /= scaling;
} else {
...
uniform_int_distribution(&theIntDist, randGenerator);
...
ret = ...
}
volatile int r = ret + theIntDist.a;

where I have used ... for code that is not relevant for the rest of the discussion.

### Optimization differences

It is now easy to see why the second case is faster — creating theIntDist in the loop makes it trivial for the compiler to determine that urange has the value 99, and the code simplifies to
uint64_t ret;
do {
ret = mersenne_twister_engine(randGenerator);
} while (ret >= 4294967200);
ret /= 42949672;
volatile int r = ret;

This simplification is not possible when theIntDist is created outside of the loop — the compiler sees that the loop calls uniform_int_distribution with a reference to theIntDist, so it must assume that the value of theIntDist.a and theIntDist.b may change during the execution and can therefore not do the constant folding. The function does, however, not modify theIntDist, so both versions of the program do the same work, but the slow version needs to do one extra comparison/branch and a few extra arithmetic instructions for each loop iteration.

### The cost of division

The mersenne_twister_engine is not a big function, but it is not trivial — it executes about 40 instructions — so it is surprising that adding a few instructions to the loop makes the program four times slower. I described a similar case in a previous blog post where the problem were due to branch mis-prediction, but the branch is perfectly predicted in  this example.

The reason here is that the slow loop need to do an integer division instruction when calculating scaling, and integer division is expensive — Agner Fog’s instruction tables says that the 64-bit division may need up to 103 cycles on the Broadwell microarchitecture! This usually does not matter too much for normal programs as as the compiler tries to move the division instructions so that they have as much time as possible to execute before the result is needed, and the CPU can in general continue executing other instructions out of order while waiting for the result of the division. But it does make a big difference in this kind of micro-benchmarks as the compiler cannot move the division earlier, and the CPU runs out of work to do out of order as the mersenne_twister_engine function executes much faster than the division.