Saturday, July 28, 2018

Don’t trust quick-bench results you see on the internet

It is easy to use quick-bench to benchmark small C++ functions, but it is hard to ensure they are measuring what is intended. Take this benchmark as an example. It is measuring different ways of transforming a string to lower case using code of the form
void by_index(char* s, size_t n)
{
    for (size_t i = 0; i < n; ++i)
        s[i] = tolower(s[i]);
}

static void LowerByIndex(benchmark::State& state)
{
    // Code inside this loop is measured repeatedly
    for (auto _ : state) {
        char upper_string[] = "UPPER TEXT";
        by_index(upper_string, 10);

        // Make sure the variable is not optimized away by compiler
        benchmark::DoNotOptimize(upper_string);
    }
}

// Register the function as a benchmark
BENCHMARK(LowerByIndex);
We are going to look at how compilers may optimize it in ways that were probably not intended.

How a compiler may optimize the code

The following corresponds to how GCC optimizes this on Linux when using the libc++ header files.1

The tolower function is defined in /user/include/ctype.h as
extern const __int32_t **__ctype_tolower_loc (void)
    throw () __attribute__ ((__const__));

extern __inline __attribute__ ((__gnu_inline__)) int
tolower (int __c) throw ()
{
    return __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c;
}
This is inlined into by_index, which in turn is inlined into LowerByIndex
static void LowerByIndex(benchmark::State&amp; state)
{
    // Code inside this loop is measured repeatedly
    for (auto _ : state) {
        char upper_string[] = "UPPER TEXT";
        for (size_t i = 0; i < 10; ++i) {
            int __c = upper_string[i]);
            upper_string[i] = __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c;
        }

        benchmark::DoNotOptimize(upper_string);
    }
}

A char has values in the range -128 to 127 when compiling for the x86 architecture, so the compiler determines that the comparisons in
upper_string[i] = __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c;
are always true (as __c is assigned from a char), and the line is simplified to
upper_string[i] = (*__ctype_tolower_loc())[__c];

The __ctype_tolower_loc function is decorated with the const attribute so the function call can be moved out of the loop, provided the loop body is always executed. Compilers typically represent for-loops as an if-statement followed by a do-while loop – for example
for (i = 0; i < n; ++i) {
    do_something();
}
is represented as
i = 0;
if (i < n) {
    do {
        do_something();
        ++i;
    } while (i < n);
}
This representation simplifies the work for the compiler as the pre-condition is separated from the actual loop, and constant expressions can now trivially be moved out of the loop. The if-statement is simplified to always true (or always false) when the iteration count is known at compile time. Rewriting the loops, and moving __ctype_tolower_loc out of the loops gives us the result
static void LowerByIndex(benchmark::State&amp; state)
{
    auto it = state.begin()
    if (it != state.end()) {
        int *p = *__ctype_tolower_loc();

        // Code inside this loop is measured repeatedly
        do {
            char upper_string[] = "UPPER TEXT";
            size_t i = 0;
            do {
                int __c = upper_string[i]);
                upper_string[i] = p[__c];
                ++i;
            } while (i < 10);

            benchmark::DoNotOptimize(upper_string);

            ++it;
        } while (it != state.end());
    }
}
Note that the call to __ctype_tolower_loc is now outside of the code segment being measured!

The inner loop is small, so it is fully unrolled
static void LowerByIndex(benchmark::State&amp;amp; state) {
    auto it = state.begin()
    if (it != state.end()) {
        int *p = *__ctype_tolower_loc();

        // Code inside this loop is measured repeatedly
        do {
            char upper_string[] = "UPPER TEXT";
            int __c0 = upper_string[0]);
            upper_string[0] = p[__c0];
            int __c1 = upper_string[1]);
            upper_string[1] = p[__c1];
            int __c2 = upper_string[2]);
            upper_string[2] = p[__c2];
            int __c3 = upper_string[3]);
            upper_string[3] = p[__c3];
            int __c4 = upper_string[4]);
            upper_string[4] = p[__c4];
            int __c5 = upper_string[5]);
            upper_string[5] = p[__c5];
            int __c6 = upper_string[6]);
            upper_string[6] = p[__c6];
            int __c7 = upper_string[7]);
            upper_string[7] = p[__c7];
            int __c8 = upper_string[8]);
            upper_string[8] = p[__c8];
            int __c9 = upper_string[9]);
            upper_string[9] = p[__c9];

            benchmark::DoNotOptimize(upper_string);

            ++it;
        } while (it != state.end());
    }
}
All accesses to upper_string are now to known positions, so the compiler can easily forward the values written to where they are read, and the generated code does not need to initialize or read from upper_string
static void LowerByIndex(benchmark::State&amp;amp; state) {
    auto it = state.begin()
    if (it != state.end()) {
        int *p = *__ctype_tolower_loc();

        // Code inside this loop is measured repeatedly
        do {
            char upper_string[10];
            upper_string[0] = p['U'];
            upper_string[1] = p['P'];
            upper_string[2] = p['P'];
            upper_string[3] = p['E'];
            upper_string[4] = p['R'];
            upper_string[5] = p[' '];
            upper_string[6] = p['T'];
            upper_string[7] = p['E'];
            upper_string[8] = p['X'];
            upper_string[9] = p['T'];

            benchmark::DoNotOptimize(upper_string);

            ++it;
        } while (it != state.end());
    }
}
Finally, several characters are the same, so the compiler can CSE (Common Subexpression Elimination) them to only load these values once from p
static void LowerByIndex(benchmark::State&amp;amp; state)
{
    auto it = state.begin()
    if (it != state.end()) {
        int *p = *__ctype_tolower_loc();

        // Code inside this loop is measured repeatedly
        do {
            char upper_string[10];
            upper_string[0] = p['U'];
            upper_string[1] = upper_string[2] = p['P'];
            upper_string[3] = upper_string[7] = p['E'];
            upper_string[4] = p['R'];
            upper_string[5] = p[' '];
            upper_string[6] = upper_string[9] = p['T'];
            upper_string[8] = p['X'];

            benchmark::DoNotOptimize(upper_string);

            ++it;
        } while (it != state.end());
    }
}
That is, the compiler has used its knowledge of the input to basically hard code the result (this may be OK, depending on what the benchmark is trying to measure). And it has moved some code out of the code segment being measured (which is probably not OK for the benchmark).

How to fix the benchmark

I usually prefer keeping the benchmarked function in a separate translation unit in order to guarantee that the compiler cannot take advantage of the code setting up the benchmark, but that does not work in quick-bench. One way to get a similar effect is to mark the function as noinline, but that only solves part of the problem – compilers do various interprocedural optimizations, and for GCC you should specify at least noclone too. Other compilers may need to be restricted in different ways.

It may also be possible to hide information from the compiler by using volatile or functionality from the benchmarking framework (such as benchmark::DoNotOptimize and benchmark::ClobberMemory), but this may also introduce unintended behavior. For example, these workarounds make the code look “unusual” to the compiler, which may make various heuristics make different optimization decisions compare to normal usage.

In general, you need to spend some time analyzing the benchmark in order to determine what the result means (for example, are we measuring the difference in how fast different methods can transform a string, or are we only measuring the difference for the string “UPPER TEXT”?), or as Fabian Giesen says in “A whirlwind introduction to dataflow graphs
With microbenchmarks, like a trial lawyer during cross-examination, you should never ask a question you don’t know the answer to (or at least have a pretty good idea of what it is). Real-world systems are generally too complex and intertwined to understand from surface measurements alone. If you have no idea how a system works at all, you don’t know what the right questions are, nor how to ask them, and any answers you get will be opaque at best, if not outright garbage. Microbenchmarks are a useful tool to confirm that an existing model is a good approximation to reality, but not very helpful in building these models to begin with.


1. The libstdc++ headers use a normal function call for tolower instead of using the inlined version from ctype.h. You can see the optimization from this blog post using libstdc++ too by including ctype.h before any C++ header (but this is not possible in quick-bench, as it adds its own headers before the user code).

3 comments:

  1. The article plants a few opportunities for posing a few extra-credit questions, and I won't resist from posting them (alternatively: exercise is essential to learning ;)

    1. Look again at return type of __ctype_tolower_loc:
    extern const __int32_t **__ctype_tolower_loc (void)

    Note how it returns not the location of the translation table, but a pointer to memory holding a pointer to the table. Why the extra indirection?

    2. In the final snippet, may the compiler move memory references such as p['U'] out of the do-while loop? And if yes, may it move assignments to upper_string[N] out of the loop as well?

    3. Note that some optimization were possible only because string contents ("UPPER TEXT") were visible to the compiler. How can source be changed to thwart those optimizations (but still allow the rest like inlining, loop invariant code motion, etc.)?

    4. Let's say the benchmark is fixed. Note how it's structured to measure throughput of 'tolower' operations. What if we want to estimate their latency instead? Is one metric more important than the other in practice?

    ReplyDelete
  2. There are some spots in the article that catch my eye as being inaccurate or unclear in describing how GCC works internally. I'll list them here as the author kindly invited me to share them, but to be clear, they don't invalidate the article :)

    First of all, as the article talks about sequence of transformations performed by the compiler, it would be nice to point out compiler -fdump-tree-all functionality so readers get not just the fish, but also the fishing rod.

    > Compilers typically represent for-loops as an if-statement followed by a do-while loop

    Compiler representation is just basic blocks and gotos, mostly. A separate pass called "loop header copying" transforms the control flow graph to have if-do-while structure. I'd say, "compilers typically transform loops to have if-do-while structure (to make life easier for optimizations, as the following text illustrates)".

    > Rewriting the loops, and moving __ctype_tolower_loc out of the loops gives us the result

    The code that follows moves not only the function call, but dereference of memory it points to (int *p = *__ctype_tolower_loc();), without explaining why it's safe to do so. Cheater! ;)

    > and for GCC you should specify at least noclone too

    Well, GCC itself used that approach and got burned: first testcases used plain noinline, then people started noticing that noinline is not enough, and new testcases tended to use 'noinline,noclone' until... you can probably guessed what happened — someone noticed that optimization advanced yet again and the noinline+noclone combo is not enough. For use in the testsuite, GCC came up with 'noipa' attribute that is specced in a forward-looking manner, but portable benchmarks need to properly model properties like which inputs are intransparent and which outputs should not be optimized out.

    ReplyDelete
  3. My preferred approach for ensuring that function calls don't make any assumptions about their target is to call the function via `volatile`-qualified pointer. On many platforms this will add some overhead, but unless the function being benchmarked is relatively trivial, the overhead--including any effect on calling code's optimization heuristics--should be small relative to the cost of the function.

    I'm not sure why there isn't a less costly standard way of telling a compiler "treat this as a call to an outside function whose code you can't see but which may have arbitrary side-effects". Calling a "volatile" function pointer doesn't technically achieve that(*), but there's no reason any sane compiler should have any difficulty achieving the proper semantics without the cost of actually fetching the function pointer.

    (*) If code makes 100 calls through a such a pointer within a loop, and it think it's likely to point to a function that doesn't access any "volatile" objects, a compiler could read the pointer up to 100 times, stopping if it ever points to anything other than the expected function, and if all 100 reads yield the same thing, then proceed to use an optimized version of the loop. Such an optimization might improve performance if a compiler happens to guess right about the function pointer, while costing relatively little if it guesses wrong. On the other hand, situations where that would actually improve performance are probably less common in most fields than situations where code uses such volatile pointers in an effort to achieve "outside-function-call" semantics.

    ReplyDelete