Wednesday, November 30, 2016

Loop optimization, and microbenchmarking

I read a blog post "Reducing the Performance Impact of Debug Code in C Libraries" that describes a way to handle overhead of debug code (compile two versions with/without debug code enabled, and load the right version dynamically). The blog post's approach is a reasonable way to solve the problem, but the example illustrating the overhead of debug code surprised me:
#include <stdio.h>
#include <stdlib.h>

#define N 100000000 // Adjust for your machine

int main(void)
{
    int debug_on = !!getenv("DEBUG");
    unsigned long sum = 0;

    for (int i = 0 ; i < N ; i++) {
        for (int j = 0 ; j < N ; j++) {
#ifdef WITH_DEBUG
            if (debug_on)
                printf("Debug print.\n");
#endif
            sum += i;
        }
    }
    printf("Sum: %lu\n", sum);

    return 0;
}
The blog post claims that the version compiled with WITH_DEBUG defined (introducing the overhead of "if (false)") is 50% slower than the one compiled without it — I had thought that both versions would have the same running time, so I tested the program compiled with "gcc -O3", and the difference was even bigger on my computer...

The reason I thought both versions would have identical performance is that GCC has an optimization -funswitch-loops (enabled at -O3) that optimizes loops having if-statements where the condition is unchanged within the loop, such as
for (int j = 0 ; j < N ; j++) {
    if (debug_on)
        printf("Debug print.\n");
    sum += i;
}
The optimization duplicates the loop, optimizes one as if the condition is true and the other as if it is false, and selects the correct version at runtime. That is, the loop above is transformed to
if (debug_on) {
    for (int j = 0 ; j < N ; j++) {
        printf("Debug print.\n");
        sum += i;
    }
} else {
    for (int j = 0 ; j < N ; j++) {
        sum += i;
    }
}
so the performance should be the same with/without WITH_DEBUG defined — both should end up running code resulting from the loop without the if-statement!

I looked some into the details of what is happening. Compiling without defining WITH_DEBUG makes GCC determine that the inner loop
for (int j = 0 ; j < N ; j++) {
    sum += i;
}
calculates
sum += (unsigned long)i * N;
This, in turn, makes the outer loop calculate the sum of an arithmetic sequence, which the compiler knows how to turn into a constant. The result is that the whole program is transformed to the equivalent of
int main(void)
{
    getenv("DEBUG");
    unsigned long sum = 996882102603448320ul;
    printf("Sum: %lu\n", sum);

    return 0;
}
Compiling with WITH_DEBUG defined determines that the inner loop does not change debug_on, and the loop is "unswitched". The sums are determined to be "i * N" as in the previous case, and the compiler sees that both branches do the same calculation that can be moved out and combined. The result is that the inner loop is transformed to
if (debug_on) {
    for (int j = 0 ; j < N ; j++)
        printf("Debug print.\n");
}
sum += (unsigned long)i * N;
The outer loop could now be unswitched but that is not happening, so the compiler continues by noticing that the sum can be directly calculated as in the previous case, and the resulting optimized program is equivalent to
int main(void)
{
    int debug_on = !!getenv("DEBUG");
    for (int i = 0 ; i < N ; i++)
        if (debug_on)
            for (int j = 0 ; j < N ; j++)
                printf("Debug print.\n");

    unsigned long sum = 996882102603448320ul;
    printf("Sum: %lu\n", sum);

    return 0;
}

The reason the program is not fully optimized is that the optimization passes are not run first on the inner loop as in my description above — they are (mostly) run over the whole program, one optimization pass at a time, and each pass requires the code to be simple enough for the optimizations to trigger. In this case, the outer loop would have been unswitched if sum had been optimized before the unswitching pass was run on the outer loop. Dealing with this is one of the "fun" parts of compiler development — you implement an amazing optimization pass that can do wonders, but the stupid real-world code need to be simplified by other passes (later in the chain) before your pass can do its magic. It is, of course, easy to fix this by adding additional optimization passes before (and in the middle of) your new pass, but the users will complain that the compiler is too slow and switch to a faster, less optimizing, compiler... So one important part of compiler development is to ensure that the pass order is good enough to handle most reasonable cases with a limited number of optimization passes.

This also means that simple micro-benchmarking may not give a true and fair view of how code is optimized in reality — I have seen many cases where micro-benchmarks in the compiler's test suite are optimized as expected, while the optimization is almost guaranteed not to trigger for real world code due to pass ordering issues. So your micro-benchmark may show that a code construct X is faster than code construct Y (or that compiler Z is better than compiler W), but the behavior in more complex real-world usage may be very different...

1 comment:

  1. Thank you! I was scratching my head also over the reported results, as they didn't seem to make sense. Now they do.

    It would be interesting to see results with different versions of gcc, as well as with different compilers, like clang.

    But the big take-away for me is that with benchmarking (and especially micro-benchmarking) it's not enough to produce results, you also need to *understand* why the results are what they are.

    ReplyDelete