Saturday, July 28, 2018

Don’t trust quick-bench results you see on the internet

It is easy to use quick-bench to benchmark small C++ functions, but it is hard to ensure they are measuring what is intended. Take this benchmark as an example. It is measuring different ways of transforming a string to lower case using code of the form
void by_index(char* s, size_t n)
{
    for (size_t i = 0; i < n; ++i)
        s[i] = tolower(s[i]);
}

static void LowerByIndex(benchmark::State& state)
{
    // Code inside this loop is measured repeatedly
    for (auto _ : state) {
        char upper_string[] = "UPPER TEXT";
        by_index(upper_string, 10);

        // Make sure the variable is not optimized away by compiler
        benchmark::DoNotOptimize(upper_string);
    }
}

// Register the function as a benchmark
BENCHMARK(LowerByIndex);
We are going to look at how compilers may optimize it in ways that were probably not intended.

How a compiler may optimize the code

The following corresponds to how GCC optimizes this on Linux when using the libc++ header files.1

The tolower function is defined in /user/include/ctype.h as
extern const __int32_t **__ctype_tolower_loc (void)
    throw () __attribute__ ((__const__));

extern __inline __attribute__ ((__gnu_inline__)) int
tolower (int __c) throw ()
{
    return __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c;
}
This is inlined into by_index, which in turn is inlined into LowerByIndex
static void LowerByIndex(benchmark::State&amp; state)
{
    // Code inside this loop is measured repeatedly
    for (auto _ : state) {
        char upper_string[] = "UPPER TEXT";
        for (size_t i = 0; i < 10; ++i) {
            int __c = upper_string[i]);
            upper_string[i] = __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c;
        }

        benchmark::DoNotOptimize(upper_string);
    }
}

A char has values in the range -128 to 127 when compiling for the x86 architecture, so the compiler determines that the comparisons in
upper_string[i] = __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c;
are always true (as __c is assigned from a char), and the line is simplified to
upper_string[i] = (*__ctype_tolower_loc())[__c];

The __ctype_tolower_loc function is decorated with the const attribute so the function call can be moved out of the loop, provided the loop body is always executed. Compilers typically represent for-loops as an if-statement followed by a do-while loop – for example
for (i = 0; i < n; ++i) {
    do_something();
}
is represented as
i = 0;
if (i < n) {
    do {
        do_something();
        ++i;
    } while (i < n);
}
This representation simplifies the work for the compiler as the pre-condition is separated from the actual loop, and constant expressions can now trivially be moved out of the loop. The if-statement is simplified to always true (or always false) when the iteration count is known at compile time. Rewriting the loops, and moving __ctype_tolower_loc out of the loops gives us the result
static void LowerByIndex(benchmark::State&amp; state)
{
    auto it = state.begin()
    if (it != state.end()) {
        int *p = *__ctype_tolower_loc();

        // Code inside this loop is measured repeatedly
        do {
            char upper_string[] = "UPPER TEXT";
            size_t i = 0;
            do {
                int __c = upper_string[i]);
                upper_string[i] = p[__c];
                ++i;
            } while (i < 10);

            benchmark::DoNotOptimize(upper_string);

            ++it;
        } while (it != state.end());
    }
}
Note that the call to __ctype_tolower_loc is now outside of the code segment being measured!

The inner loop is small, so it is fully unrolled
static void LowerByIndex(benchmark::State&amp;amp; state) {
    auto it = state.begin()
    if (it != state.end()) {
        int *p = *__ctype_tolower_loc();

        // Code inside this loop is measured repeatedly
        do {
            char upper_string[] = "UPPER TEXT";
            int __c0 = upper_string[0]);
            upper_string[0] = p[__c0];
            int __c1 = upper_string[1]);
            upper_string[1] = p[__c1];
            int __c2 = upper_string[2]);
            upper_string[2] = p[__c2];
            int __c3 = upper_string[3]);
            upper_string[3] = p[__c3];
            int __c4 = upper_string[4]);
            upper_string[4] = p[__c4];
            int __c5 = upper_string[5]);
            upper_string[5] = p[__c5];
            int __c6 = upper_string[6]);
            upper_string[6] = p[__c6];
            int __c7 = upper_string[7]);
            upper_string[7] = p[__c7];
            int __c8 = upper_string[8]);
            upper_string[8] = p[__c8];
            int __c9 = upper_string[9]);
            upper_string[9] = p[__c9];

            benchmark::DoNotOptimize(upper_string);

            ++it;
        } while (it != state.end());
    }
}
All accesses to upper_string are now to known positions, so the compiler can easily forward the values written to where they are read, and the generated code does not need to initialize or read from upper_string
static void LowerByIndex(benchmark::State&amp;amp; state) {
    auto it = state.begin()
    if (it != state.end()) {
        int *p = *__ctype_tolower_loc();

        // Code inside this loop is measured repeatedly
        do {
            char upper_string[10];
            upper_string[0] = p['U'];
            upper_string[1] = p['P'];
            upper_string[2] = p['P'];
            upper_string[3] = p['E'];
            upper_string[4] = p['R'];
            upper_string[5] = p[' '];
            upper_string[6] = p['T'];
            upper_string[7] = p['E'];
            upper_string[8] = p['X'];
            upper_string[9] = p['T'];

            benchmark::DoNotOptimize(upper_string);

            ++it;
        } while (it != state.end());
    }
}
Finally, several characters are the same, so the compiler can CSE (Common Subexpression Elimination) them to only load these values once from p
static void LowerByIndex(benchmark::State&amp;amp; state)
{
    auto it = state.begin()
    if (it != state.end()) {
        int *p = *__ctype_tolower_loc();

        // Code inside this loop is measured repeatedly
        do {
            char upper_string[10];
            upper_string[0] = p['U'];
            upper_string[1] = upper_string[2] = p['P'];
            upper_string[3] = upper_string[7] = p['E'];
            upper_string[4] = p['R'];
            upper_string[5] = p[' '];
            upper_string[6] = upper_string[9] = p['T'];
            upper_string[8] = p['X'];

            benchmark::DoNotOptimize(upper_string);

            ++it;
        } while (it != state.end());
    }
}
That is, the compiler has used its knowledge of the input to basically hard code the result (this may be OK, depending on what the benchmark is trying to measure). And it has moved some code out of the code segment being measured (which is probably not OK for the benchmark).

How to fix the benchmark

I usually prefer keeping the benchmarked function in a separate translation unit in order to guarantee that the compiler cannot take advantage of the code setting up the benchmark, but that does not work in quick-bench. One way to get a similar effect is to mark the function as noinline, but that only solves part of the problem – compilers do various interprocedural optimizations, and for GCC you should specify at least noclone too. Other compilers may need to be restricted in different ways.

It may also be possible to hide information from the compiler by using volatile or functionality from the benchmarking framework (such as benchmark::DoNotOptimize and benchmark::ClobberMemory), but this may also introduce unintended behavior. For example, these workarounds make the code look “unusual” to the compiler, which may make various heuristics make different optimization decisions compare to normal usage.

In general, you need to spend some time analyzing the benchmark in order to determine what the result means (for example, are we measuring the difference in how fast different methods can transform a string, or are we only measuring the difference for the string “UPPER TEXT”?), or as Fabian Giesen says in “A whirlwind introduction to dataflow graphs
With microbenchmarks, like a trial lawyer during cross-examination, you should never ask a question you don’t know the answer to (or at least have a pretty good idea of what it is). Real-world systems are generally too complex and intertwined to understand from surface measurements alone. If you have no idea how a system works at all, you don’t know what the right questions are, nor how to ask them, and any answers you get will be opaque at best, if not outright garbage. Microbenchmarks are a useful tool to confirm that an existing model is a good approximation to reality, but not very helpful in building these models to begin with.


1. The libstdc++ headers use a normal function call for tolower instead of using the inlined version from ctype.h. You can see the optimization from this blog post using libstdc++ too by including ctype.h before any C++ header (but this is not possible in quick-bench, as it adds its own headers before the user code).