It is easy to use quick-bench to benchmark small C++ functions, but it is hard to ensure they are measuring what is intended. Take this benchmark as an example. It is measuring different ways of transforming a string to lower case using code of the form
void by_index(char* s, size_t n) { for (size_t i = 0; i < n; ++i) s[i] = tolower(s[i]); } static void LowerByIndex(benchmark::State& state) { // Code inside this loop is measured repeatedly for (auto _ : state) { char upper_string[] = "UPPER TEXT"; by_index(upper_string, 10); // Make sure the variable is not optimized away by compiler benchmark::DoNotOptimize(upper_string); } } // Register the function as a benchmark BENCHMARK(LowerByIndex);
We are going to look at how compilers may optimize it in ways that were probably not intended.
The
How a compiler may optimize the code
The following corresponds to how GCC optimizes this on Linux when using the libc++ header files.1The
tolower
function is defined in /user/include/ctype.h
asextern const __int32_t **__ctype_tolower_loc (void) throw () __attribute__ ((__const__)); extern __inline __attribute__ ((__gnu_inline__)) int tolower (int __c) throw () { return __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c; }This is inlined into
by_index
, which in turn is inlined into LowerByIndex
static void LowerByIndex(benchmark::State& state) { // Code inside this loop is measured repeatedly for (auto _ : state) { char upper_string[] = "UPPER TEXT"; for (size_t i = 0; i < 10; ++i) { int __c = upper_string[i]); upper_string[i] = __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c; } benchmark::DoNotOptimize(upper_string); } }
A
char
has values in the range -128 to 127 when compiling for the x86 architecture, so the compiler determines that the comparisons inupper_string[i] = __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c;
are always true (as
__c
is assigned from a char
), and the line is simplified toupper_string[i] = (*__ctype_tolower_loc())[__c];
The
__ctype_tolower_loc
function is decorated with the const
attribute so the function call can be moved out of the loop, provided the loop body is always executed. Compilers typically represent for-loops as an if-statement followed by a do-while loop – for examplefor (i = 0; i < n; ++i) { do_something(); }is represented as
i = 0; if (i < n) { do { do_something(); ++i; } while (i < n); }
This representation simplifies the work for the compiler as the pre-condition is separated from the actual loop, and constant expressions can now trivially be moved out of the loop. The if-statement is simplified to always true (or always false) when the iteration count is known at compile time. Rewriting the loops, and moving
__ctype_tolower_loc
out of the loops gives us the resultstatic void LowerByIndex(benchmark::State& state) { auto it = state.begin() if (it != state.end()) { int *p = *__ctype_tolower_loc(); // Code inside this loop is measured repeatedly do { char upper_string[] = "UPPER TEXT"; size_t i = 0; do { int __c = upper_string[i]); upper_string[i] = p[__c]; ++i; } while (i < 10); benchmark::DoNotOptimize(upper_string); ++it; } while (it != state.end()); } }
Note that the call to
__ctype_tolower_loc
is now outside of the code segment being measured!
The inner loop is small, so it is fully unrolled
static void LowerByIndex(benchmark::State&amp; state) { auto it = state.begin() if (it != state.end()) { int *p = *__ctype_tolower_loc(); // Code inside this loop is measured repeatedly do { char upper_string[] = "UPPER TEXT"; int __c0 = upper_string[0]); upper_string[0] = p[__c0]; int __c1 = upper_string[1]); upper_string[1] = p[__c1]; int __c2 = upper_string[2]); upper_string[2] = p[__c2]; int __c3 = upper_string[3]); upper_string[3] = p[__c3]; int __c4 = upper_string[4]); upper_string[4] = p[__c4]; int __c5 = upper_string[5]); upper_string[5] = p[__c5]; int __c6 = upper_string[6]); upper_string[6] = p[__c6]; int __c7 = upper_string[7]); upper_string[7] = p[__c7]; int __c8 = upper_string[8]); upper_string[8] = p[__c8]; int __c9 = upper_string[9]); upper_string[9] = p[__c9]; benchmark::DoNotOptimize(upper_string); ++it; } while (it != state.end()); } }
All accesses to
upper_string
are now to known positions, so the compiler can easily forward the values written to where they are read, and the generated code does not need to initialize or read from upper_string
static void LowerByIndex(benchmark::State&amp; state) { auto it = state.begin() if (it != state.end()) { int *p = *__ctype_tolower_loc(); // Code inside this loop is measured repeatedly do { char upper_string[10]; upper_string[0] = p['U']; upper_string[1] = p['P']; upper_string[2] = p['P']; upper_string[3] = p['E']; upper_string[4] = p['R']; upper_string[5] = p[' ']; upper_string[6] = p['T']; upper_string[7] = p['E']; upper_string[8] = p['X']; upper_string[9] = p['T']; benchmark::DoNotOptimize(upper_string); ++it; } while (it != state.end()); } }
Finally, several characters are the same, so the compiler can CSE (Common Subexpression Elimination) them to only load these values once from
p
static void LowerByIndex(benchmark::State&amp; state) { auto it = state.begin() if (it != state.end()) { int *p = *__ctype_tolower_loc(); // Code inside this loop is measured repeatedly do { char upper_string[10]; upper_string[0] = p['U']; upper_string[1] = upper_string[2] = p['P']; upper_string[3] = upper_string[7] = p['E']; upper_string[4] = p['R']; upper_string[5] = p[' ']; upper_string[6] = upper_string[9] = p['T']; upper_string[8] = p['X']; benchmark::DoNotOptimize(upper_string); ++it; } while (it != state.end()); } }
That is, the compiler has used its knowledge of the input to basically hard code the result (this may be OK, depending on what the benchmark is trying to measure). And it has moved some code out of the code segment being measured (which is probably not OK for the benchmark).
How to fix the benchmark
I usually prefer keeping the benchmarked function in a separate translation unit in order to guarantee that the compiler cannot take advantage of the code setting up the benchmark, but that does not work in quick-bench. One way to get a similar effect is to mark the function as
noinline
, but that only solves part of the problem – compilers do various interprocedural optimizations, and for GCC you should specify at least noclone
too. Other compilers may need to be restricted in different ways.
It may also be possible to hide information from the compiler by using
volatile
or functionality from the benchmarking framework (such as benchmark::DoNotOptimize
and benchmark::ClobberMemory
), but this may also introduce unintended behavior. For example, these workarounds make the code look “unusual” to the compiler, which may make various heuristics make different optimization decisions compare to normal usage.
In general, you need to spend some time analyzing the benchmark in order to determine what the result means (for example, are we measuring the difference in how fast different methods can transform a string, or are we only measuring the difference for the string “UPPER TEXT”?), or as Fabian Giesen says in “A whirlwind introduction to dataflow graphs”
With microbenchmarks, like a trial lawyer during cross-examination, you should never ask a question you don’t know the answer to (or at least have a pretty good idea of what it is). Real-world systems are generally too complex and intertwined to understand from surface measurements alone. If you have no idea how a system works at all, you don’t know what the right questions are, nor how to ask them, and any answers you get will be opaque at best, if not outright garbage. Microbenchmarks are a useful tool to confirm that an existing model is a good approximation to reality, but not very helpful in building these models to begin with.
1. The libstdc++ headers use a normal function call for
tolower
instead of using the inlined version from ctype.h
. You can see the optimization from this blog post using libstdc++ too by including ctype.h
before any C++ header (but this is not possible in quick-bench, as it adds its own headers before the user code).