Wednesday, February 22, 2017

Branch misprediction is expensive: an example

The SciMark 2.0 Monte Carlo benchmark is calculating $$\pi$$ by generating random points $$\{(x,y) \mid x,y \in [0,1]\}$$ and calculating the ratio of points that are located within the quarter circle $$\sqrt{x^2 + y^2} \le 1$$. The square root can be avoided by squaring both sides, and the benchmark is implemented as
double MonteCarlo_integrate(int Num_samples)
{
Random R = new_Random_seed(SEED);
int under_curve = 0;

for (int count = 0; count < Num_samples; count++)
{
double x = Random_nextDouble(R);
double y = Random_nextDouble(R);
if (x*x + y*y <= 1.0)
under_curve++;
}

Random_delete(R);
return ((double) under_curve / Num_samples) * 4.0;
}
GCC used to generate a conditional move for this if-statement, but a recent change made this generate a normal branch which caused a 30% performance reduction for the benchmark due to the branch being mispredicted (bug 79389).

The randomization function is not inlined as it is compiled in a separate file, and it contains a non-trivial amount of loads, stores, and branches
typedef struct
{
int m;
int seed, i, j, haveRange;
double left, right, width;
} Random_struct, *Random;

#define MDIG 32
#define ONE 1
static const int m1 = (ONE << (MDIG-2)) + ((ONE << (MDIG-2)) - ONE);
static const int m2 = ONE << MDIG/2;
static double dm1;

double Random_nextDouble(Random R)
{
int I = R->i;
int J = R->j;
int *m = R->m;

int k = m[I] - m[J];
if (k < 0)
k += m1;
R->m[J] = k;

if (I == 0)
I = 16;
else
I--;
R->i = I;

if (J == 0)
J = 16;
else
J--;
R->j = J;

if (R->haveRange)
return  R->left +  dm1 * (double) k * R->width;
else
return dm1 * (double) k;
}
so I had expected the two calls to this function to dominate the running time, and that the cost of the branch would not affect the benchmark too much. But I should have known better — x86 CPUs can have more than 100 instructions in flight (192 micro-ops for Broadwell), and a mispredict need to throw away all that work and restart from the actual branch target.

The cost of branch instructions differ between different CPU implementations, and the compiler needs to take that into account when optimizing and generating branches.

Simple processors with a 3-stage pipeline fetch the next instruction when previous two instructions are decoded and executed, but branches introduce a problem: the next instruction cannot be fetched before the address is calculated by executing the branch instruction. This makes branches expensive as they introduce bubbles in the pipeline. The cost can be reduced for conditional branches by speculatively fetching and decoding the instructions after the branch — this improves performance if the branch was not taken, but taken branches need to discard the speculated work, and restart from the actual branch target.

Some CPUs have instructions that can execute conditionally depending on a condition, and this can be used to avoid branches. For example
if (c)
a += 3;
else
b -= 2;
can be compiled to the following straight line code on ARM (assuming that a, b, and c are placed in r1, r2, and r0 respectively)
cmp     r0, #0
subeq   r2, r2, #2
The cmp instruction sets the Z flag in the status register, and the addne instruction is treated as an addition if Z is 0, and as a nop instruction if Z is 1. subeq is similarly treated as a subtraction if Z is 1 and as a nop if Z is 0. The instruction takes time to execute, even when treated as a nop, but this is still much faster than executing branches.

This means that the compiler should structure the generated code so that branches are minimized (using conditional execution when possible), and conditional branches should be generated so that the most common case is not taking the branch.

Taken branches become more expensive as the CPUs get deeper pipelines, and this is especially annoying as loops must branch to the top of the loop for each iteration. This can be solved by adding more hardware to let the fetch unit calculate the target address of the conditional branch, and the taken branch can now be the cheap case.

It is, however, nice to have the “not taken” case be the cheap case, as the alternative often introduce contrived control flow that fragments the instruction cache and need to insert extra “useless” (and expensive) unconditional branches. The way most CPUs solve this is to predict that forward branches are unlikely (and thus speculatively fetch from following instructions), and that backward branches are likely (and thus speculatively fetch from the branch target).

The compiler should do similar work as for the simpler CPU, but structure the code so that conditional branches branching forward are not taken in the common case, and conditional branches branching backward are taken in the common case.

There are many branches that the compiler cannot predict, so the next step up in complexity is adding branch prediction to the CPU. The basic idea is that the CPU keeps a cache of previous branch decisions and use this to predict the current branch. High-end branch predictors look at the history of code flow, and can correctly predict repetitive patterns in how the branch behaved. Hardware vendors do not publish detailed information about how the prediction work, but Agner Fog’s optimization manuals contain lots of information (especially part 3, “The microarchitecture of Intel, AMD and VIA CPUs”, that also have a good overview of different ways branch prediction can be done).

Branch prediction in high-end CPUs is really good, so branches are essentially free, while conditional execution adds extra dependencies between instructions which constrain the out-of-order execution engine, so conditional execution should be avoided. This is essentially the opposite from how the simple CPUs should be handled. 😃

There is one exception — branches that cannot be predicted (such as the one in SciMark) should be generated using conditional instructions, as the branch will incur the misprediction cost of restarting the pipeline each time it is mispredicted.

The compiler should not use conditional execution unless the condition is unpredictable. The code should be structured as for the static prediction (this is not strictly necessary, but most CPUs use the static prediction first time a branch is encountered. And it is also slightly more efficient for the instruction cache).

So branches are free, except when they cannot be predicted. I find it amusing that many algorithms (balanced search trees, etc.) have the aim to make the branches as random as possible. I do not know how much this is a problem in reality, but Clang has a built-in function, __builtin_unpredictable, that can be used to tell the compiler that the condition is unpredictable.

Heuristics for estimating branch probabilities

The compiler estimates branch probabilities in order to generate the efficient form of the branch (there are more optimizations that need to know if a code segment is likely executed or not, such as inlining and loop unrolling). The general idea, as described in the PLDI ’93 paper “Branch Prediction for Free”, is to look at how the branches are used. For example, code such as
if (p == NULL)
return -1;
comparing a pointer with NULL and returning a constant, is most likely error handling, and thus unlikely to execute the return statement.

GCC has a number of such predictors, for example
• Branch ending with returning a constant is probably not taken.
• Branch from comparison using != is probably taken, == is probably not taken.
• Branch to a basic block calling a cold function is probably not taken.
Each predictor provides a probability (that has been set from branch frequencies observed by instrumenting real world code), and these probabilities are combined for a final result. This is one reason why __builtin_expect often does not make any difference — the heuristics are already coming to the same conclusion!

The predictors are defined in predict.def (some of the definitions seem reversed due to how the rules are implemented, e.g. PROB_VERY_LIKELY may mean “very unlikely”, but the comments describing each heuristic are correct). You can see how GCC is estimating the branch probabilities by passing -fdump-tree-profile_estimate to the compiler, which writes a file containing the output from the predictors for each basic block
Predictions for bb 2
DS theory heuristics: 1.7%
combined heuristics: 1.7%
pointer (on trees) heuristics of edge 2->4: 30.0%
call heuristics of edge 2->3: 33.0%
negative return heuristics of edge 2->4: 2.0%
as well as (when using GCC 7.x) the IR annotated with the estimated probabilities.

1. 1. 