Sunday, March 5, 2017

The cost of conditional moves and branches

The previous blog post contained an example where branching was much more expensive than using a conditional move, but it is easy to find cases where conditional moves reduce performance noticeably. One such case is in this stack overflow question (and GCC bug 56309) discussing the performance of a function implementing a naive bignum multiplication
static void inline
single_mult(const std::vector<ull>::iterator& data,
            const std::vector<ull>::const_iterator& rbegin,
            const std::vector<ull>::const_iterator& rend,
            const ull x)
{
    ull tmp=0, carry=0, i=0;
    for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it)
    {
        tmp = x * (*rhs_it) + data[i] + carry;
        if (tmp >= imax) {
            carry = tmp >> numbits;
            tmp &= imax - 1;
        } else { 
            carry = 0;
        }
        data[i++] = tmp;
    }
    data[i] += carry;
}

void
naive(std::vector<ull>::iterator data, 
      std::vector<ull>::const_iterator cbegin,
      std::vector<ull>::const_iterator cend,
      std::vector<ull>::const_iterator rbegin,
      std::vector<ull>::const_iterator rend)
{
    for (auto data_it = cbegin; data_it != cend; ++data_it)
    {
        if (*data_it != 0) {
            single_mult(data, rbegin, rend, *data_it);
        }
        ++data;
    }
}
Minor changes to the source code made the compiler use conditional moves instead of a branch, and this reduced the performance by 25%.

The difference between branches and conditional moves can be illustrated by
a = a + b;
if (c > 0)
    a = -a;
a = a + 1;
It is not possible to calculate the number of clock cycles for a code segment when working with reasonably complex CPUs, but it is often easy to get a good estimate (see e.g. this example for how to use such estimates when optimizing assembly code). The CPU converts the original instructions to micro-ops, and it can dispatch several micro-ops per cycle (e.g. 8 for Broadwell). The details are somewhat complicated,1 but most instructions in this blog post are translated to one micro-op that can be executed without any restrictions.

An assembly version using a branch looks like (assuming that the variables are placed in registers)
    addl    %edx, %eax
    testl   %ecx, %ecx
    jle     .L2
    negl    %eax
.L2:
    addl    $1, %eax
The CPU combines the testl and jle instructions to one micro-op by what is called “macro-fusion”, so both the addition and the test/branch instructions can be dispatched in the first cycle. It takes a while for the compare and branch to execute, but branch prediction means that the CPU can speculatively start executing the next instruction in the following cycle, so the final addl or the negl can be dispatched in the second cycle (depending on if the branch is predicted as taken or not). The result is that the code segment is done in 2 or 3 cycles, provided that the branch prediction was correct — a mispredict must discard the speculated instructions and restart execution, which typically adds 15–20 cycles.

Generating a version using a conditional move produces something like
    addl    %edx, %eax
    movl    %eax, %edx
    negl    %edx
    testl   %ecx, %ecx
    cmovg   %edx, %eax
    addl    $1, %eax
The first cycle will execute the first addition and the test instruction, and the following cycles will only be able to execute one instruction at a time as all of them depend on the previous instruction. The result is that this needs 5 cycles to execute.2

So the version with conditional moves takes twice the time to execute compared to the version using a branch, which is noticeable in the kind of short loops from single_mult. In addition, pipeline-restrictions on how instructions can be dispatched (such as only one division instruction can be dispatched each cycle) makes it hard for the CPU to schedule long dependency chains efficiently, which may be a problem for more complex code.


1. See “Intel 64 and IA-32 Architectures Optimization Reference Manual” and Agner Fog’s optimization manuals for the details.
2. This assumes that the cmovg instruction is one micro-op. That is true for some CPUs such as Broadwell, while others split it into two micro-ops.

Wednesday, February 22, 2017

Branch prediction

Branch misprediction is expensive: an example

The SciMark 2.0 Monte Carlo benchmark is calculating \(\pi\) by generating random points \(\{(x,y) \mid x,y \in [0,1]\}\) and calculating the ratio of points that are located within the quarter circle \(\sqrt{x^2 + y^2} \le 1\). The square root can be avoided by squaring both sides, and the benchmark is implemented as
double MonteCarlo_integrate(int Num_samples)
{
    Random R = new_Random_seed(SEED);
    int under_curve = 0;

    for (int count = 0; count < Num_samples; count++)
    {
        double x = Random_nextDouble(R);
        double y = Random_nextDouble(R);
        if (x*x + y*y <= 1.0)
            under_curve++;
    }

    Random_delete(R);
    return ((double) under_curve / Num_samples) * 4.0;
}
GCC used to generate a conditional move for this if-statement, but a recent change made this generate a normal branch which caused a 30% performance reduction for the benchmark due to the branch being mispredicted (bug 79389).

The randomization function is not inlined as it is compiled in a separate file, and it contains a non-trivial amount of loads, stores, and branches
typedef struct
{
    int m[17];
    int seed, i, j, haveRange;
    double left, right, width;
} Random_struct, *Random;

#define MDIG 32
#define ONE 1
static const int m1 = (ONE << (MDIG-2)) + ((ONE << (MDIG-2)) - ONE);
static const int m2 = ONE << MDIG/2;
static double dm1;

double Random_nextDouble(Random R) 
{
    int I = R->i;
    int J = R->j;
    int *m = R->m;

    int k = m[I] - m[J];
    if (k < 0)
        k += m1;
    R->m[J] = k;

    if (I == 0) 
        I = 16;
    else
        I--;
    R->i = I;

    if (J == 0) 
        J = 16;
    else
        J--;
    R->j = J;

    if (R->haveRange) 
        return  R->left +  dm1 * (double) k * R->width;
    else
        return dm1 * (double) k;
}
so I had expected the two calls to this function to dominate the running time, and that the cost of the branch would not affect the benchmark too much. But I should have known better — x86 CPUs can have more than 100 instructions in flight (192 micro-ops for Broadwell), and a mispredict need to throw away all that work and restart from the actual branch target.


Branch overhead and branch prediction

The cost of branch instructions differ between different CPU implementations, and the compiler needs to take that into account when optimizing and generating branches.

Simple processors with a 3-stage pipeline fetch the next instruction when previous two instructions are decoded and executed, but branches introduce a problem: the next instruction cannot be fetched before the address is calculated by executing the branch instruction. This makes branches expensive as they introduce bubbles in the pipeline. The cost can be reduced for conditional branches by speculatively fetching and decoding the instructions after the branch — this improves performance if the branch was not taken, but taken branches need to discard the speculated work, and restart from the actual branch target.

Some CPUs have instructions that can execute conditionally depending on a condition, and this can be used to avoid branches. For example
if (c)
    a += 3;
else
    b -= 2;
can be compiled to the following straight line code on ARM (assuming that a, b, and c are placed in r1, r2, and r0 respectively)
cmp     r0, #0
addne   r1, r1, #3
subeq   r2, r2, #2
The cmp instruction sets the Z flag in the status register, and the addne instruction is treated as an addition if Z is 0, and as a nop instruction if Z is 1. subeq is similarly treated as a subtraction if Z is 1 and as a nop if Z is 0. The instruction takes time to execute, even when treated as a nop, but this is still much faster than executing branches.

This means that the compiler should structure the generated code so that branches are minimized (using conditional execution when possible), and conditional branches should be generated so that the most common case is not taking the branch.

Taken branches become more expensive as the CPUs get deeper pipelines, and this is especially annoying as loops must branch to the top of the loop for each iteration. This can be solved by adding more hardware to let the fetch unit calculate the target address of the conditional branch, and the taken branch can now be the cheap case.

It is, however, nice to have the “not taken” case be the cheap case, as the alternative often introduce contrived control flow that fragments the instruction cache and need to insert extra “useless” (and expensive) unconditional branches. The way most CPUs solve this is to predict that forward branches are unlikely (and thus speculatively fetch from following instructions), and that backward branches are likely (and thus speculatively fetch from the branch target).

The compiler should do similar work as for the simpler CPU, but structure the code so that conditional branches branching forward are not taken in the common case, and conditional branches branching backward are taken in the common case.

There are many branches that the compiler cannot predict, so the next step up in complexity is adding branch prediction to the CPU. The basic idea is that the CPU keeps a cache of previous branch decisions and use this to predict the current branch. High-end branch predictors look at the history of code flow, and can correctly predict repetitive patterns in how the branch behaved. Hardware vendors do not publish detailed information about how the prediction work, but Agner Fog’s optimization manuals contain lots of information (especially part 3, “The microarchitecture of Intel, AMD and VIA CPUs”, that also have a good overview of different ways branch prediction can be done).

Branch prediction in high-end CPUs is really good, so branches are essentially free, while conditional execution adds extra dependencies between instructions which constrain the out-of-order execution engine, so conditional execution should be avoided. This is essentially the opposite from how the simple CPUs should be handled. 😃

There is one exception — branches that cannot be predicted (such as the one in SciMark) should be generated using conditional instructions, as the branch will incur the misprediction cost of restarting the pipeline each time it is mispredicted.

The compiler should not use conditional execution unless the condition is unpredictable. The code should be structured as for the static prediction (this is not strictly necessary, but most CPUs use the static prediction first time a branch is encountered. And it is also slightly more efficient for the instruction cache).

So branches are free, except when they cannot be predicted. I find it amusing that many algorithms (balanced search trees, etc.) have the aim to make the branches as random as possible. I do not know how much this is a problem in reality, but Clang has a built-in function, __builtin_unpredictable, that can be used to tell the compiler that the condition is unpredictable.


Heuristics for estimating branch probabilities

The compiler estimates branch probabilities in order to generate the efficient form of the branch (there are more optimizations that need to know if a code segment is likely executed or not, such as inlining and loop unrolling). The general idea, as described in the PLDI ’93 paper “Branch Prediction for Free”, is to look at how the branches are used. For example, code such as
if (p == NULL)
    return -1;
comparing a pointer with NULL and returning a constant, is most likely error handling, and thus unlikely to execute the return statement.

GCC has a number of such predictors, for example
  • Branch ending with returning a constant is probably not taken.
  • Branch from comparison using != is probably taken, == is probably not taken.
  • Branch to a basic block calling a cold function is probably not taken.
Each predictor provides a probability (that has been set from branch frequencies observed by instrumenting real world code), and these probabilities are combined for a final result. This is one reason why __builtin_expect often does not make any difference — the heuristics are already coming to the same conclusion!

The predictors are defined in predict.def (some of the definitions seem reversed due to how the rules are implemented, e.g. PROB_VERY_LIKELY may mean “very unlikely”, but the comments describing each heuristic are correct). You can see how GCC is estimating the branch probabilities by passing -fdump-tree-profile_estimate to the compiler, which writes a file containing the output from the predictors for each basic block
Predictions for bb 2
  DS theory heuristics: 1.7%
  combined heuristics: 1.7%
  pointer (on trees) heuristics of edge 2->4: 30.0%
  call heuristics of edge 2->3: 33.0%
  negative return heuristics of edge 2->4: 2.0%
as well as (when using GCC 7.x) the IR annotated with the estimated probabilities.

Sunday, January 8, 2017

GCC code generation for C++ Weekly Ep 43 example

Episode 43 of “C++ Weekly” talks about evaluating and eliminating code at compile time, and the example is fun as it triggers a few different deficiencies in the GCC optimization passes (using the -O3 optimization level with GCC trunk r243987 built for x86_64-linux-gnu).

The example
#include <type_traits>
#include <numeric>
#include <iterator>

template<typename ... T>
int sum(T ... t)
{
  std::common_type_t<T...> array[sizeof...(T)]{ t... };

  return std::accumulate(std::begin(array), std::end(array), 0);
}

int main()
{
  return sum(5,4,3,2,1);
}
is meant to be optimized to return a constant value, and it does
main:
    movl    $15, %eax
    ret
But the behavior varies a lot depending on how many arguments are passed to sum1–5 arguments work as expected, while calling sum with 6 or 7 arguments is generated as
main:
 movdqa .LC0(%rip), %xmm0
 movaps %xmm0, -40(%rsp)
 movl -32(%rsp), %edx
 movl -28(%rsp), %eax
 leal 14(%rdx,%rax), %eax
 ret
where .LC0 is an array consisting of four constants. 9–12 arguments are similar (but with more code). 

13–28 arguments are generated as a constant again
main:
    movl    $91, %eax
    ret

29–64 arguments are optimized to a constant, but with some redundant stack adjustments when the number of arguments is not divisible by four
main:
    subq    $16, %rsp
    movl    $435, %eax
    addq    $16, %rsp
    ret

Finally, 65 and more arguments are generated as a vectorized monstrosity in a separate function, called from main by pushing all the arguments to the stack, one at a time
main:
    subq    $16, %rsp
    movl    $60, %r9d
    movl    $61, %r8d
    pushq   $1
    pushq   $2
    movl    $62, %ecx
    pushq   $3
    ushq    $4
    ...
This is essentially as far from generating a constant as you can come. 😀

The rest of the blog post will look at how GCC is reasoning when trying to optimize this, by examining GCC's internal representation of the program at different points in the optimization pipeline. The IR works essentially as a restricted version of the C language, and you can get GCC to write the IR to a file after each pass by using the command-line option -fdump-tree-all.

1–5 arguments

The use of std::accumulate and iterators expand to five functions, and the compiler starts by inlining and simplifying this to
int main() ()
{
  common_type_t array[5];
  int __init;
  int * __first;
  int _3;

  <bb 2> [16.67%]:
  array[0] = 5;
  array[1] = 4;
  array[2] = 3;
  array[3] = 2;
  array[4] = 1;

  <bb 3> [100.00%]:
  # __first_2 = PHI <&array(2), __first_6(4)>
  # __init_4 = PHI <0 __init_5>
  if (&MEM[(void *)&array + 20B] == __first_2)
    goto <bb 5>; [16.67%]
  else
    goto <bb 4>; [83.33%]

  <bb 4> [83.33%]:
  _3 = *__first_2;
  __init_5 = _3 + __init_4;
  __first_6 = __first_2 + 4;
  goto <bb 3>; [100.00%]

  <bb 5> [16.67%]:
  array ={v} {CLOBBER};
  return __init_4;
}
The loop is immediately unrolled and simplified to
int main() ()
{
  common_type_t array[5];
  int __init;
  int * __first;
  int _15;
  int _20;
  int _25;
  int _30;
  int _35;

  <bb 2> [16.70%]:
  array[0] = 5;
  array[1] = 4;
  array[2] = 3;
  array[3] = 2;
  array[4] = 1;
  _15 = MEM[(int *)&array];
  __init_16 = _15;
  _20 = MEM[(int *)&array + 4B];
  __init_21 = _15 + _20;
  _25 = MEM[(int *)&array + 8B];
  __init_26 = __init_21 + _25;
  _30 = MEM[(int *)&array + 12B];
  __init_31 = __init_26 + _30;
  _35 = MEM[(int *)&array + 16B];
  __init_36 = __init_31 + _35;
  array ={v} {CLOBBER};
  return __init_36;
}
that is then optimized to a constant by the fre3 (“Full Redundancy Elimination”) pass
int main() ()
{
  <bb 2> [16.70%]:
  return 15;
}

6–12 arguments

The early optimizations that handled the previous case are there mostly to get rid of noise before the heavier optimizations (such as loop optimizations) kicks in, and the loop doing 6 iterations is considered too big to be unrolled by the early unroller.

The “real” loop optimizers determine that the loop is not iterating enough for vectorization to be profitable, so it is just unrolled. The “SLP vectorizer” that vectorizes straight line code is run right after the loop optimizations, and it sees that we are copying constants into consecutive addresses, so it combines four of them to a vector assignment
MEM[(int *)&array] = { 6, 5, 4, 3 };
array[4] = 2;
array[5] = 1;
This is now simplified by the dom3 pass that does SSA dominator optimizations (jump threading, redundancy elimination, and const/copy propagation), but it does not understand that a scalar initialized by a constant vector is a constant, so it only propagates the constants for array[4] and array[5] that were initialized as scalars, and the code passed to the backend looks like
int main() ()
{
  common_type_t array[6];
  int __init;
  int _15;
  int _22;
  int _27;
  int _32;

  <bb 2> [14.31%]:
  MEM[(int *)&array] = { 6, 5, 4, 3 };
  _15 = MEM[(int *)&array];
  _22 = MEM[(int *)&array + 4B];
  __init_23 = _15 + _22;
  _27 = MEM[(int *)&array + 8B];
  __init_28 = __init_23 + _27;
  _32 = MEM[(int *)&array + 12B];
  __init_33 = __init_28 + _32;
  __init_43 = __init_33 + 3;
  array ={v} {CLOBBER};
  return __init_43;
}

13–28 arguments

The loop is now iterated enough times that the compiler determines that vectorization is profitable. The idea behind the vectorization is to end up with something like
tmp = { array[0], array[1], array[2], array[3] }
    + { array[4], array[5], array[6], array[7] }
    + { array[8], array[9], array[10], array[11] };
sum = tmp[0] + tmp[1] + tmp[2] + tmp[3] + array[12];
and the vectorizer generates two loops — one that consumes four elements at a time as long as possible, and one that consumes the remaining elements one at a time. The rest of the loop optimizers know how many times the loops are iterating, so the loops can then be unrolled etc. as appropriate.

The vectorizer is, unfortunately, generating somewhat strange code for the checks that there are enough elements
_35 = (unsigned long) &MEM[(void *)&array + 52B];
_36 = &array + 4;
_37 = (unsigned long) _36;
_38 = _35 - _37;
_39 = _38 /[ex] 4;
_40 = _39 & 4611686018427387903;
if (_40 <= 4)
  goto ; [10.00%]
else
  goto ; [90.00%]
that confuse the rest of the loop optimizations, with the result that the IR contains lots of conditional code of this form. This is not the first time I have seen GCC having problems with the pointer arithmetics from iterators (see bug 78847), and I believe this is the same problem (as the bitwise and should be optimized away when the pointer arithmetics has been evaluated to a constant).

The subsequent passes mostly manage to clean up these conditionals, and dom3 optimizes the vector operations to a constant. But it does not understand the expression used to decide how many scalar elements need to be handled if the iteration count is not a multiple of four (that check is eliminated by the value range propagation pass after dom3 is run), so the scalar additions are kept in the code given to the backend
int main() ()
{
  common_type_t array[13];
  int __init;
  int _23;

   [7.14%]:
  array[12] = 1;
  _23 = MEM[(int *)&array + 48B];
  __init_3 = _23 + 90;
  array ={v} {CLOBBER};
  return __init_3;
}
This is, however, not much of a problem for this program, as the backend manages to optimize this to
main:
    movl    $91, %eax
    ret
when generating the code.

2964 arguments

The backend eliminates the memory accesses to the array in the previous case, but the array seems to be kept on the stack. That makes sense — the code is supposed to be optimized before being passed to the backend, so the backend should not be able to eliminate variables, and there is no need to implement code doing this.

Leaf functions do not need to adjust the stack, but GCC does some stack adjustment on leaf functions too when more than 112 bytes are placed on the stack. You can see this for the meaningless function
void foo()
{
    volatile char a[113];
    a[0] = 0;
}
where the stack is adjusted when the array size is larger than 112.
foo:
    subq    $16, %rsp
    movb    $0, -120(%rsp)
    addq    $16, %rsp
    ret
I do not understand what GCC is trying to do here...

Anyway, passing 29 arguments to sum makes the array large enough that GCC adds the stack adjustments.

65– arguments

The sequence of assignments initializing the array is now large enough that sum is not inlined into main.

Thursday, December 15, 2016

Pointer comparison — an invalid optimization in GCC

I wrote in an earlier blog post that GCC has a somewhat aggressive interpretation of the standard so that it compiles p == q to false if p and q are pointers derived from different objects. I'm now convinced GCC is wrong.

The problem

C does not permit pointers to point at any memory — a pointer must point to an address within an object or on the address immediately following the object, and pointer arithmetic cannot make it point outside that range. When comparing pointers, the C standard says (emphasis added)
Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.
That is, the comparison in the printf call
#include <stdio.h>

int main(void)
{
    int x[4], y[4];
    int *p = &x[4];
    int *q = &y[0];
    printf("%p %p %d\n", (void*)p, (void*)q, p == q);
    return 0;
}
evaluates to true if the compiler happens to place y directly after x in memory. GCC does, however, substitute address comparison to false if the pointers are derived from different objects, and the program above may give output such as
0x7fff66339290 0x7fff66339290 0
where the comparison evaluates to false even though both pointers are the same. The argument in bug 61502 is essentially that the standard is unclear what "immediately follow" means, and applications cannot rely on memory layout anyway (so there is no real use case). I believe both are wrong.

Use case

One use case for this kind of comparison in firmware for small embedded devices — you have a framework that supports different hardware configurations, but you want to avoid adding an #ifdef for every optional functionality. Instead, each implemented functionality exports a structure containing configuration data, and the linker collects them into an address range and generates a symbol __start for the start address, and __end for the address immediately following the array. The good thing about this is that the array contains exactly what is present at link time — elements are added to the array by just adding files to the link command, and functionality can be included/excluded by just relinking, without any need for additional configuration.

C does not have a convenient way of representing the __end label, so it is typically expressed as
extern struct my_struct __start[];
extern struct my_struct __end[];
where the linker guarantees that __end is immediately following __start. The array is processed as
void foo(void)
{
    for (struct my_struct *x = __start; x != __end; x++)
        do_something(x);
}
which is compiled to an infinite loop if GCC optimizes comparisons of pointers derived from two objects as always being different. The workaround suggested in the GCC mailing lists is to insert some inline assembly to confuse the compiler
void foo(void)
{
    struct my_struct *x = __start;
    struct my_struct *y = __end;
    asm("":"+r"(x));
    asm("":"+r"(y));
    for (; x != y; x++)
        do_something(x);
}
but this should not be needed for the natural reading of the standard excerpt quoted above.

This particular use case seems to work now as a side-effect by the "missing" optimization of global constant addresses, so the nasty workaround is not needed. But some GCC developers still claim the code is invalid...

The meaning of "following"

Comment #3 of the bug report argues that GCC is correct:
Except within a larger object, I'm not aware of any reason the cases of two objects following or not following each other in memory must be mutually exclusive. (If the implementation can track the origins of bit-patterns and where copies of those bit-patterns have got to, it might have a compacting garbage collector that relocates objects and changes what's adjacent to what, for example — I think such implementations are within the scope of what the C standard is intended to support. Or if you're concerned about how this changes bit-patterns of pointers, imagine that a C pointer is a (object key, offset) pair, and that comparison first converts the C pointer into a hardware address, where it's the mapping from object keys to hardware addresses that changes as a result of garbage collection rather than anything about the representation of the pointer.)
I do not think this is a correct interpretation. For example, DR #077 asks the question "Is the address of an object constant throughout its lifetime?", and the committee answers (emphasis added)
The C Standard does not explicitly state that the address of an object remains constant throughout the life of the object. That this is the intent of the Committee can be inferred from the fact that an address constant is considered to be a constant expression. The framers of the C Standard considered that it was so obvious that addresses should remain constant that they neglected to craft words to that effect.
That is, compacting garbage collection is not within the scope of what the C standard supports. I'm even less convinced by the key/offset example — this is the same situation as "normal" hardware, as you can view the virtual address as a pair (page, offset), and it is still the case that x[16] is immediately following x[15] for an array
uint32_t x[32];
even if x crosses a page boundary so that x[15] and x[16] are placed on different pages.

Friday, December 9, 2016

GCC __attribute__((pure)) and C++ exceptions

John Regehr tweeted the example
int foo(void) __attribute__((pure));

int bar(void) {
  int acc = 0;
  for (int i = 0; i < 100; ++i)
    acc += foo();
  return acc;
}
that is optimized to
_Z3barv:
    pushq   %rax
    callq   _Z3foov
    imull   $100, %eax, %eax
    popq    %rcx
    retq
by clang, but is not optimized by GCC. The obvious question is "why is it not optimized?"

GCC does actually optimize this if it is compiled as C, but not if compiled as C++. The reason is that GCC allows pure functions to throw exceptions in C++ which prevents some optimizations — bar is optimized as expected in C++ too if foo is marked as noexcept, or if exceptions are disabled by passing the command-line option -fno-exceptions to the compiler.

There are additional restrictions on how pure functions may be optimized. The description of the pure attribute in the GCC manual says
Many functions have no effects except the return value and their return value depends only on the parameters and/or global variables. Such a function can be subject to common subexpression elimination and loop optimization just as an arithmetic operator would be. These functions should be declared with the attribute pure.
The possible use of global variables limits how calls to the pure functions may be moved. For example,
int foo(void) __attribute__((pure));
int a;

int baz(void)
{
    int acc = 0;
    acc += foo();
    a = 0;
    acc += foo();
   return acc;
}
cannot be optimized to the equivalent of
int baz(void)
{
    a = 0;
    return 2 * foo();
}
as it is possible that the result of foo() depends on the global variable a.

There is another attribute, const, for functions without side effects that do not depend on global variables
Many functions do not examine any values except their arguments, and have no effects except the return value. Basically this is just slightly more strict class than the pure attribute below, since function is not allowed to read global memory.
Decorating foo with the const attribute would optimize the function baz above, but the compiler is still restricted in how it can move functions, even if they are known not to throw exceptions and are marked as const — they may take arbitrarily long time to execute, so the compiler should not move a function call so that it is executed on a path where it was not executed in the original code.

This means that the optimization opportunities of pure and const functions are rather limited, and the possibility of throwing exceptions does not affect it that much — a pure or const function that throws an exception will always do it for the same input, so they can still be optimized by subexpression elimination, memoization, etc. The only extra restriction is that the first call of a function must not be moved over some other construct that can throw an exception.

So GCC is overly conservative when handling the loop in bar — the optimization is valid even if foo may throw — but most more complex cases cannot be handled. For example,
int foo() __attribute__((pure));
int foo2(int) __attribute__((pure));

int bar2() 
{
    int acc = 0;
    int acc2 = 0;
    for (int i = 0; i < 100; i++) {
        acc2 += foo2(i);
        acc += foo();
    }
    return acc + acc2;
}
cannot be optimized to the equivalent of
int bar2() 
{
    int acc2 = 0;
    for (int i = 0; i < 100; i++) {
        acc2 += foo2(i);
    }
    return 100 * foo() + acc2;
}
as this will give a different result if both foo() and foo2(1) throws (while foo2(0) does not throw). There are ways to handle some of these more complex cases (e.g. peeling the first iteration of the loop to get the first calls in the correct order, and then optimizing the rest of the loop), but this is not common enough that it is worth complicating the optimizers by special-casing the few cases that can be optimized.

Wednesday, November 30, 2016

Loop optimization, and microbenchmarking

I read a blog post "Reducing the Performance Impact of Debug Code in C Libraries" that describes a way to handle overhead of debug code (compile two versions with/without debug code enabled, and load the right version dynamically). The blog post's approach is a reasonable way to solve the problem, but the example illustrating the overhead of debug code surprised me:
#include <stdio.h>
#include <stdlib.h>

#define N 100000000 // Adjust for your machine

int main(void)
{
    int debug_on = !!getenv("DEBUG");
    unsigned long sum = 0;

    for (int i = 0 ; i < N ; i++) {
        for (int j = 0 ; j < N ; j++) {
#ifdef WITH_DEBUG
            if (debug_on)
                printf("Debug print.\n");
#endif
            sum += i;
        }
    }
    printf("Sum: %lu\n", sum);

    return 0;
}
The blog post claims that the version compiled with WITH_DEBUG defined (introducing the overhead of "if (false)") is 50% slower than the one compiled without it — I had thought that both versions would have the same running time, so I tested the program compiled with "gcc -O3", and the difference was even bigger on my computer...

The reason I thought both versions would have identical performance is that GCC has an optimization -funswitch-loops (enabled at -O3) that optimizes loops having if-statements where the condition is unchanged within the loop, such as
for (int j = 0 ; j < N ; j++) {
    if (debug_on)
        printf("Debug print.\n");
    sum += i;
}
The optimization duplicates the loop, optimizes one as if the condition is true and the other as if it is false, and selects the correct version at runtime. That is, the loop above is transformed to
if (debug_on) {
    for (int j = 0 ; j < N ; j++) {
        printf("Debug print.\n");
        sum += i;
    }
} else {
    for (int j = 0 ; j < N ; j++) {
        sum += i;
    }
}
so the performance should be the same with/without WITH_DEBUG defined — both should end up running code resulting from the loop without the if-statement!

I looked some into the details of what is happening. Compiling without defining WITH_DEBUG makes GCC determine that the inner loop
for (int j = 0 ; j < N ; j++) {
    sum += i;
}
calculates
sum += (unsigned long)i * N;
This, in turn, makes the outer loop calculate the sum of an arithmetic sequence, which the compiler knows how to turn into a constant. The result is that the whole program is transformed to the equivalent of
int main(void)
{
    getenv("DEBUG");
    unsigned long sum = 996882102603448320ul;
    printf("Sum: %lu\n", sum);

    return 0;
}
Compiling with WITH_DEBUG defined determines that the inner loop does not change debug_on, and the loop is "unswitched". The sums are determined to be "i * N" as in the previous case, and the compiler sees that both branches do the same calculation that can be moved out and combined. The result is that the inner loop is transformed to
if (debug_on) {
    for (int j = 0 ; j < N ; j++)
        printf("Debug print.\n");
}
sum += (unsigned long)i * N;
The outer loop could now be unswitched but that is not happening, so the compiler continues by noticing that the sum can be directly calculated as in the previous case, and the resulting optimized program is equivalent to
int main(void)
{
    int debug_on = !!getenv("DEBUG");
    for (int i = 0 ; i < N ; i++)
        if (debug_on)
            for (int j = 0 ; j < N ; j++)
                printf("Debug print.\n");

    unsigned long sum = 996882102603448320ul;
    printf("Sum: %lu\n", sum);

    return 0;
}

The reason the program is not fully optimized is that the optimization passes are not run first on the inner loop as in my description above — they are (mostly) run over the whole program, one optimization pass at a time, and each pass requires the code to be simple enough for the optimizations to trigger. In this case, the outer loop would have been unswitched if sum had been optimized before the unswitching pass was run on the outer loop. Dealing with this is one of the "fun" parts of compiler development — you implement an amazing optimization pass that can do wonders, but the stupid real-world code need to be simplified by other passes (later in the chain) before your pass can do its magic. It is, of course, easy to fix this by adding additional optimization passes before (and in the middle of) your new pass, but the users will complain that the compiler is too slow and switch to a faster, less optimizing, compiler... So one important part of compiler development is to ensure that the pass order is good enough to handle most reasonable cases with a limited number of optimization passes.

This also means that simple micro-benchmarking may not give a true and fair view of how code is optimized in reality — I have seen many cases where micro-benchmarks in the compiler's test suite are optimized as expected, while the optimization is almost guaranteed not to trigger for real world code due to pass ordering issues. So your micro-benchmark may show that a code construct X is faster than code construct Y (or that compiler Z is better than compiler W), but the behavior in more complex real-world usage may be very different...

Thursday, November 24, 2016

Tentative variable definitions, and -fno-common

A comment in a Reddit thread on my previous blog post claims that the code is optimized if the variables are declared in the same translation unit as the usage. That is, the claim is that
int a;
int b;

int foo(void)
{
    return &a != &b;
}
compiles to code returning a constant value. This is true for C++
_Z3foov:
    movl    $1, %eax
    ret
but compiling as C still generates
foo:
    movl    $a, %eax
    cmpq    $b, %rax
    setne   %al
    movzbl  %al, %eax
    ret
The reason is that an external declaration without an initializer is what the C standard calls a tentative definition. The tentative definition works in essentially the same way as if it were defined extern (but with the difference that the linker creates the variable if it is not defined in any translation unit), so GCC must assume that the linker may assign the same address for a and b in the same way as for normal extern-declared variables. It is possible to make the variables "real" definitions by initializing them with a value, so changing the declarations to
int a = 0;
int b = 0;
enables the compiler to optimize the function in C too.

One other way to make GCC optimize this is to specify the command-line option -fno-common that disables generation of "common symbols" which are used for emitting tentative definitions. This has the benefit of improving code generation for some CPU architectures, such as ARM. As an example, consider the function
int a;
int b;

int foo(void)
{
    return a + b;
}
The ARM architecture need the variables' addresses placed in registers in order to access the values, so the compiler will generate code similar to
foo:
    ldr     r2, .L2         /* Load address of a */
    ldr     r3, .L2+4       /* Load address of b */
    ldr     r0, [r2]        /* Load a */
    ldr     r3, [r3]        /* Load b */
    add     r0, r0, r3
    bx      lr
.L2:
    .word   a
    .word   b
The compiler could do a better job if it knew how the variables are laid out in memory, as it then only needs to load one address and use relative addressing for the other. But that is not possible when a and b are tentative definitions, as the linker may place them wherever it chooses. Using -fno-common will, however, place a and b in the data (or BSS) segment exactly as they are laid out by the compiler, and the code can be optimized based on knowledge that b is placed right after a in memory1
foo:
    ldr     r2, .L2         /* Load address of a */
    ldr     r0, [r2]        /* Load a */
    ldr     r3, [r2, #4]    /* Load b */
    add     r0, r0, r3
    bx      lr
.L2:
    .word   a


1. A real ARM compiler would use an ldmia instruction in this case, which is even better. But I choose to use this naive code generation in the example in order to show the general principle, without discussing too many ARM-specific details.