Sunday, March 5, 2017

The cost of conditional moves and branches

The previous blog post contained an example where branching was much more expensive than using a conditional move, but it is easy to find cases where conditional moves reduce performance noticeably. One such case is in this stack overflow question (and GCC bug 56309) discussing the performance of a function implementing a naive bignum multiplication
static void inline
single_mult(const std::vector<ull>::iterator& data,
const std::vector<ull>::const_iterator& rbegin,
const std::vector<ull>::const_iterator& rend,
const ull x)
{
ull tmp=0, carry=0, i=0;
for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it)
{
tmp = x * (*rhs_it) + data[i] + carry;
if (tmp >= imax) {
carry = tmp >> numbits;
tmp &= imax - 1;
} else {
carry = 0;
}
data[i++] = tmp;
}
data[i] += carry;
}

void
naive(std::vector<ull>::iterator data,
std::vector<ull>::const_iterator cbegin,
std::vector<ull>::const_iterator cend,
std::vector<ull>::const_iterator rbegin,
std::vector<ull>::const_iterator rend)
{
for (auto data_it = cbegin; data_it != cend; ++data_it)
{
if (*data_it != 0) {
single_mult(data, rbegin, rend, *data_it);
}
++data;
}
}

Minor changes to the source code made the compiler use conditional moves instead of a branch, and this reduced the performance by 25%.

The difference between branches and conditional moves can be illustrated by
a = a + b;
if (c > 0)
a = -a;
a = a + 1;

It is not possible to calculate the number of clock cycles for a code segment when working with reasonably complex CPUs, but it is often easy to get a good estimate (see e.g. this example for how to use such estimates when optimizing assembly code). The CPU converts the original instructions to micro-ops, and it can dispatch several micro-ops per cycle (e.g. 8 for Broadwell). The details are somewhat complicated,1 but most instructions in this blog post are translated to one micro-op that can be executed without any restrictions.

An assembly version using a branch looks like (assuming that the variables are placed in registers)
    addl    %edx, %eax
testl   %ecx, %ecx
jle     .L2
negl    %eax
.L2:
addl    $1, %eax  The CPU combines the testl and jle instructions to one micro-op by what is called “macro-fusion”, so both the addition and the test/branch instructions can be dispatched in the first cycle. It takes a while for the compare and branch to execute, but branch prediction means that the CPU can speculatively start executing the next instruction in the following cycle, so the final addl or the negl can be dispatched in the second cycle (depending on if the branch is predicted as taken or not). The result is that the code segment is done in 2 or 3 cycles, provided that the branch prediction was correct — a mispredict must discard the speculated instructions and restart execution, which typically adds 15–20 cycles. Generating a version using a conditional move produces something like  addl %edx, %eax movl %eax, %edx negl %edx testl %ecx, %ecx cmovg %edx, %eax addl$1, %eax

The first cycle will execute the first addition and the test instruction, and the following cycles will only be able to execute one instruction at a time as all of them depend on the previous instruction. The result is that this needs 5 cycles to execute.2

So the version with conditional moves takes twice the time to execute compared to the version using a branch, which is noticeable in the kind of short loops from single_mult. In addition, pipeline-restrictions on how instructions can be dispatched (such as only one division instruction can be dispatched each cycle) makes it hard for the CPU to schedule long dependency chains efficiently, which may be a problem for more complex code.

1. See “Intel 64 and IA-32 Architectures Optimization Reference Manual” and Agner Fog’s optimization manuals for the details.
2. This assumes that the cmovg instruction is one micro-op. That is true for some CPUs such as Broadwell, while others split it into two micro-ops.

1. Do you know if any compilers support a "hint" saying particular branch is likely to be "random" and branches would likely be mis-predicted more than 1/3 of the time? By the sound of it, correctly predicted branches are significantly cheaper than conditional moves, but incorrectly predicted ones are significantly more expensive. If a branch will be taken about half the time, but there is unlikely to be any exploitable pattern to it (e.g. the branching condition is being generated by a linear feedback shift register), I would expect the costs of mis-predicted branches to outweigh the fixed cost of conditional moves, but I don't think even profile-driven optimization would figure out such a thing without "hints" unless it can somehow determine how which branches would have been predicted correctly in the absence of profiling code.

1. Clang has __builtin_unpredictable: http://clang.llvm.org/docs/LanguageExtensions.html#builtin-unpredictable

2. The profile-driven optimizations cannot figure this out, but I guess it could be implented for the sampling-based feedback-directed optimizations ($$\verb!-fauto-profile!$$) by capturing and processing the “branch-misses” hardware counter...

2. So if the compiler predicts the branch correctly with fraction p, and mis-predicts with fraction q (note that p+q==1.0), the total time will be 3p+25q. With the cmove, you always need 5. So when is 3p+25q >= 5?

If the branch prediction is correct more than 91% of the time, the branch is faster.

The long dependency chain might be mitigated by having other stuff the CPU can do before or after that passage. If the input is unpredictable p==q==0.5, which is nearly 3× slower than the cmov. With finding some other work to do with the unused dispatch slots in most normal code, a less-than-stellar branch prediction is bad news. Many times, the special case is the loop exit, so it predicts “keep looping” and is wrong once. For loops that will be iterated >10 times, branch prediction will be good. For small loops, say 5 items, you can still do OK if the loop variable is on its own; e.g. the i counter. That’s because the CPU will unroll the loop and the ++i and i>0 can be seen many iterations ahead of the actual most-committed instruction.

So, black magic.