I have started blogging again!
But I am too annoyed at Blogger, so the new posts will be published in my new blog: https://kristerw.github.io/.
Random thoughts on programming languages, compilers, operating systems, etc.
I have started blogging again!
But I am too annoyed at Blogger, so the new posts will be published in my new blog: https://kristerw.github.io/.
*p = expensive_computation();where it may make sense to reorganize the code to only do the expensive computation when the value is needed.
free
, etc.). It also records how many times the instruction was executed, and how many of those wrote the same value as was there before.struct { int a, b, c, d; } foo;and code that is initializing the structure as
memset(&foo, 0, sizeof(foo));
then the compiler will usually optimize the memset
to one SIMD instruction that is clearing all 128 bits. If the program now only reads three of the values, then we will see that 4 of the 16 bytes are dead – instruction granularity would not detect this as the stored value is (partly) used.sum
is a 32-bit int
variable containing the value 15
, and we add 10
to itsum += 10;then this would be reported as 3 silent stores if we used byte granularity (as three of the bytes are
0
both before and after the operation).deadstores
tool is used in the same way as other Valgrind tools
valgrind --tool=deadstores [valgrind-options] your-prog [your-prog-options]The program to analyze must be built with debug symbols (
-g
or better), and optimizations should be enabled in order to produce relevant data. You also need -fno-omit-frame-pointer
if you are passing --use-stack-trace=yes
(see below) to the tool.script/print_result.py
that extracts the top \(n\) instructions from each of “dead stores”, “silent stores”, and “silent loads”.deadstores.out.<pid>
(where <pid>
is the program’s process ID), but it can be changed by --deadstores-out-file=<filename>
. The %p
, %q
, and %n
format specifiers can be used to embed the process ID, the contents of an environment variable, and/or a sequence number in the name, in the same ways as for the core option --log-file
.0x054b0a25: bytes_written: 11916560 bytes_read: 248126 bytes_dead: 11668434 nof_stores: 744785 nof_silent: 19326 at 0x054b0a25: __memset_avx2 (in /build/glibc-Cl5G7W/glibc-2.23/string/../sysdeps/x86_64/multiarch/memset-avx2.S:85)The dead stores here come from the
memset
implementation in libc
, and the values are the sums of all calls to memset
in the program. This is not that helpful if we want to find dead stores we can eliminate, but the tool can instead track the loads/stores by the stack trace, which splits the reported data for an instruction according to where it was called from0x054b0a25: bytes_written: 1707072 bytes_read: 1120 bytes_dead: 1705952 nof_stores: 106692 nof_silent: 2500 at 0x054b0a25: __memset_avx2 (in /build/glibc-Cl5G7W/glibc-2.23/string/../sysdeps/x86_64/multiarch/memset-avx2.S:85) by 0x00b317aa: poison_pages() (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc-page.c:2125) by 0x00b3362c: ggc_collect() (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc-page.c:2226) by 0x00bb82d4: cgraph_node::finalize_function(tree_node*, bool) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/cgraphunit.c:494) by 0x00a60e43: expand_or_defer_fn(tree_node*) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/cp/semantics.c:4486) 0x054b0a25: bytes_written: 521808 bytes_read: 12 bytes_dead: 521796 nof_stores: 32613 nof_silent: 0 at 0x054b0a25: __memset_avx2 (in /build/glibc-Cl5G7W/glibc-2.23/string/../sysdeps/x86_64/multiarch/memset-avx2.S:85) by 0x00d127fe: ggc_internal_cleared_alloc(unsigned long, void (*)(void*), unsigned long, unsigned long) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc-common.c:118) by 0x012b796e: make_node(tree_code) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc.h:143) by 0x012b7c96: build_tree_list(tree_node*, tree_node*) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/tree.c:3184) by 0x009e32d6: cp_parser_parameter_declaration_list(cp_parser*, int) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/cp/parser.c:22525) ...This is enabled by
--use-stack-trace=yes
which in general makes the result much more useful (and the tool much slower…). The program to analyze must be compiled width frame pointer support (-fno-omit-frame-pointer
) in order to get a reliable result.--trace-children=yes
which is useful when analyzing programs (such as gcc
) that start subprocesses using the exec
system call.add %rcx,%raxis translated in a natural way to an IR instruction adding two values
t10 = Add64(t7,t4)But some instructions are translated in a way that causes problems. For example, the bit test instruction
bt %rax,%r8is generated as a sequence that is storing 8 bytes on an unused address on the stack and reading back one byte of the result
t75 = GET:I64(48) ; <-- Get stack pointer t74 = Sub64(t75,0x120:I64) PUT(48) = t74 STle(t74) = t73 ; <-- Store 8 bytes t18 = And64(t36,0x3F:I64) t78 = Sar64(t18,0x3:I8) t77 = Add64(t74,t78) t80 = And64(t18,0x7:I64) t112 = 64to8(t80) t79 = t112 t15 = LDle:I8(t77) ; <-- Load 1 byte t84 = GET:I64(168) t113 = amd64g_calculate_rflags_all[mcx=0x9]{0x581732b0}(0x13:I64,t61,0x0:I64,t84):I64 t85 = t113 t114 = 8Uto64(t15) t89 = t114 t88 = Shr64(t89,t79) t87 = And64(t88,0x1:I64) t90 = And64(t85,0x8D4:I64) t86 = Or64(t90,t87) PUT(144) = 0x0:I64 PUT(160) = 0x0:I64 PUT(152) = t86 PUT(168) = 0x0:I64 t91 = Add64(t74,0x120:I64) PUT(48) = t91This makes the
deadstores
tool treat this as an 8-byte store where 7 of those bytes are dead… I have worked around this particular case (by ignoring stores to offset 0
from the stack pointer – this holds the pointer to the stack frame for normal functions, so these memory accesses are not relevant for the tool anyway), but there may be additional cases I have not found yet…deadstores
tool is that the tracking of silent loads only handles memory addresses that have been stored to, so it does not detect silent loads in read-only data segments. This could be fixed by tracking valid memory in the same way as done by the memcheck
tool.node
structurestruct node { handle parent; handle left; handle right; enum color color; struct key key; // Store Long bytes_written; Long bytes_read; Long nof_stores; Long nof_silent_stores; // Load Long nof_loads; Long nof_silent_loads; };The
node
structures are stored in an array called nodes
, and the nodes are organized as a red-black tree keyed by the address of the instruction (or rather, “generalized address” – it contains the top 5 addresses of the stack trace when --use-stack-trace=yes
is used). The tree structure is managed by the parent
, left
, and right
fields – these are conceptually pointers to node
structures, but we are using handles (that are just the index into the nodes
array) to make them valid even if we need to realloc
the nodes
array to increase its size.addr_info
structure for that address (these are allocated in blocks of 1<<20
addresses, and the blocks are stored in a hash table addr_info_table
).
struct addr_info { unsigned store_instr_h : 31; unsigned is_written : 1; };The
addr_info
structure contains 1 bit telling if the byte is valid (that is, if it has been written to, and not been invalidated by the memory being free
ed etc.), and a handle to the instruction that wrote to this byte. Programs do not use that many load/store instructions, so the handle is stored in 31 bits to make the structure as structure fit in 32 bits. This makes a big difference when tracing memory-hungry programs...deadstores
tool adds its instrumentation in the function ds_instrument
.ds_instrument
iterates through the IR to find the instructions that access memory – the most important are Ist_WrTemp
(load) and Ist_Store
(store), but there are a few others (such as the atomic compare-and-swap Ist_CAS
). Each such instruction is instrumented using the Ist_Dirty
IR instruction that is used to call a helper function. The deadstores
tool has one such helper function for tracking each of “dead stores”, “silent stores”, and “loads” (each instruction that writes memory gets both of the “dead stores” and “silent stores” helper functions, and the “silent stores” helper function must be called first).deadstores
tool take three parameters – the address and size of the memory access, and the address of the instruction. The function that is tracking dead stores updates the counters for the instruction (that the instruction has been executed and written size
bytes) and the status of each written byte (that it has been written, and which instruction that wrote the value).static void dead_store_tracker(Addr addr, SizeT size, Addr iaddr) { handle h = iaddr2handle(iaddr); nodes[h].nof_stores += 1; nodes[h].bytes_written += size; for (SizeT i = 0; i < size; i++) { struct addr_info *ai = addr2addr_info(addr + i); ai->is_written = True; ai->store_instr_h = h; } }The load tracking checks each byte’s
store_instr_h
to see which instruction wrote this byte. If this is different from 0
, then the instruction corresponding to the handle has its bytes_read
counter incremented, and the byte’s store_instr_h
is set to 0
to mark it as read. If all of the bytes were written but has a zero for store_instr_h
, then the load is silent, and the load instruction’s counter is updated accordingly.static void load_tracker(Addr addr, SizeT size, Addr iaddr) { SizeT silent_load = 0; for (SizeT i = 0; i < size; i++) { struct addr_info *ai = addr2addr_info(addr + i); if (ai->store_instr_h != 0) { nodes[ai->store_instr_h].bytes_read++; ai->store_instr_h = 0; } else if (ai->is_written) { silent_load += 1; } } handle h = iaddr2handle(iaddr); nodes[h].nof_loads += 1; if (silent_load == size) { nodes[h].nof_silent_loads += 1; } }The tracking of silent stores needs a bit more work as it must compare the value of the original memory with the newly written. This is cannot be done in the helper function, so it is done by inserting new IR that does the comparison, and the helper function is not called unless the values are identical. This functionality is partially supported by Valgrind – the
Ist_Dirty
instruction has an optional guard
parameter taking an IR expression determining if the function is to be called, so the only thing needed is to generate the comparison and pass it as the guard. And this comparison is straight forward to emit – if st
is an IRExpr
pointer to a 32-bit Ist_Store
, then the IR for the comparison can be emitted asIRExpr *data = st->Ist.Store.data; IRType type = typeOfIRExpr(tyenv, data); IRExpr *addr = st->Ist.Store.addr; tl_assert(type == Ity_I32); IRExpr *load = emit_load(sb, type, addr); IRExpr *guard = emit_binop(sb, Iop_CmpEQ32, data, load);where the
emit_
-functions are simple wrappers around the Valgrind API to get rid of some boilerplate code (sb
is the IR superblock containing the instructions – this is one of the inputs to ds_instrument
).malloc
ed memory that by luck happens to contain the same value that is stored by the instruction), and update the instruction’s silent_stores
counter if they have.static void silent_store_tracker(Addr addr, SizeT size, Addr iaddr) { Bool is_written = True; for (SizeT i = 0; i < size; i++) { struct addr_info *ai = addr2addr_info(addr + i); is_written = is_written && ai->is_written; } if (is_written) { handle h = iaddr2handle(iaddr); nodes[h].nof_silent_stores += 1; } }
track_new_mem_stack
and track_die_mem_stack
that are called each time the stack pointer is decreased/increased, or the needs_malloc_replacement
callbacks that are called when the program is calling malloc
, free
, etc.deadstores
tool is using these in order to mark the memory as invalid (that is, clearing the is_written
flag) to prevent counters being incorrectly updated by the silent load/store tracking when, for example, previously freed memory is returned from malloc
.iaddr2handle
and addr2addr_info
. It is easy to improve how this is done.memcpy
where the program could have used the original data.int sum(int count) { int result = 0; for (int j = 0; j < count; ++j) result += j*j; return result; }to code that calculates the result without a loop (godbolt)
sum(int): test edi, edi jle .LBB0_1 lea eax, [rdi - 1] lea ecx, [rdi - 2] imul rcx, rax lea eax, [rdi - 3] imul rax, rcx shr rax imul eax, eax, 1431655766 add eax, edi shr rcx lea ecx, [rcx + 2*rcx] lea eax, [rax + rcx] add eax, -1 ret .LBB0_1: xor eax, eax retIt handles more complex cases too (godbolt) – that is, the optimization is not just a simple pattern matcher. This post will show how the optimization is done.
void foo(int m, int *p) { for (int j = 0; j < m; j++) *p++ = j; }The loop writes \(0\) to
*p++
in the first iteration, \(1\) in the second, etc. So we can express the value written at iteration \(i\) as\[f(i) =void foo(int m, int k, int *p) { for (int j = 0; < m; j++) *p++ = j*j*j - 2*j*j + k*j + 7; }We will see below how to build the functions, but the result for the value stored in this loop is \[\begin{align}f_2(i) & =
void foo(int m, int k, int *p) { int t0 = 7; int t1 = k-1; int t2 = 2; for (int j = 0; j < m; j++) { *p++ = t0; t0 = t0 + t1; t1 = t1 + t2; t2 = t2 + 6; } }which is a useful optimization for architectures were multiplication is expensive. This kind of code is, however, uncommon, so most compilers do not do this optimzation – but they usually do this for simpler cases, such as
void foo(int m, int k, int *p) { for (int j = 0; < m; j++) *p++ = k*j + 7; }as constructs of the form
k*j+7
are common in address calculations.sum
functionfor (int j = 0; j < count; ++j) result += j*j;we start with
j
which we know has the CR \(\{0,+,1\}\) per Example 1. This is then used as j*j
when calculating result
, so we calculate the CR for j*j
using the simplification rules as \[\begin{align}j*j& = \{0,+,1\} * \{0,+,1\} \\result
gives us the CR \(\{0,+,0,+,1,+,2\}\) for the value at the beginning of the iteration, and \(\{0,+,1,+,3,+,2\}\) after adding j*j
.int sum(int count) { int result = 0; if (count > 0) { int j = 0; do { result = result + j*j; ++j; } while (j < count); } return result; }or as it looks like in the LLVM IR
define i32 @sum(i32) { %2 = icmp sgt i32 %0, 0 br i1 %2, label %3, label %6 ; <label>:3: br label %8 ; <label>:4: %5 = phi i32 [ %12, %8 ] br label %6 ; <label>:6: %7 = phi i32 [ 0, %1 ], [ %5, %4 ] ret i32 %7 ; <label>:8: %9 = phi i32 [ %13, %8 ], [ 0, %3 ] ; {0,+,1} %10 = phi i32 [ %12, %8 ], [ 0, %3 ] ; {0,+,0,+,1,+,2} %11 = mul nsw i32 %9, %9 ; {0,+,1,+,2} %12 = add nuw nsw i32 %11, %10 ; {0,+,1,+,3,+,2} %13 = add nuw nsw i32 %9, 1 ; {1,+,1} %14 = icmp slt i32 %13, %0 br i1 %14, label %8, label %4 }The compiler can see that the function returns
0
if count <= 0
, otherwise it returns the value of result
at loop iteration count-1
.result
gives us \[f(i) = i + {3i(i-1)\over 2} + {i(i-1)(i-2) \over 3}\] The compiler now only need to insert code that calculates this with \(i =\) count-1
, after the loopresult = count-1 + 3*(count-1)*(count-2)/2 + (count-1)*(count-2)(count-3)/3;but it need to take some care to calculate in the correct precision (as temporary values may not fit in 32-bit integers). Integer division is slow, so it is also doing tricks to replace the divisions with multiplication and shift instructions. The result is the LLVM IR
%4 = add i32 %0, -1 %5 = zext i32 %4 to i33 %6 = add i32 %0, -2 %7 = zext i32 %6 to i33 %8 = mul i33 %5, %7 %9 = add i32 %0, -3 %10 = zext i32 %9 to i33 %11 = mul i33 %8, %10 %12 = lshr i33 %11, 1 %13 = trunc i33 %12 to i32 %14 = mul i32 %13, 1431655766 %15 = add i32 %14, %0 %16 = lshr i33 %8, 1 %17 = trunc i33 %16 to i32 %18 = mul i32 %17, 3 %19 = add i32 %15, %18 %20 = add i32 %19, -1Inserting this makes the loop become dead, so it is later removed by dead code elimination, and we eventually end up with the code
sum(int): test edi, edi jle .LBB0_1 lea eax, [rdi - 1] lea ecx, [rdi - 2] imul rcx, rax lea eax, [rdi - 3] imul rax, rcx shr rax imul eax, eax, 1431655766 add eax, edi shr rcx lea ecx, [rcx + 2*rcx] lea eax, [rax + rcx] add eax, -1 ret .LBB0_1: xor eax, eax ret
int sum(int count) { int result = 0; for (int j = 0; j < count; ++j) result += j*j*j*j*j*j; return result; }is calculated by three 32-bit multiplications and one addition within the loop, while the optimized version needs six 64-bit multiplications, five 32-bit multiplications, and a slew of other instructions (godbolt), so the optimized version is slower for small values of
count
.
I benchmarked on my PC, and count
must be larger than 5
for the optimized version to be faster than the loop. Smaller CPUs with, for example, more expensive 64-bit multiplication, will need a higher count
for the optimization to help. And CPUs not having instructions for 64-bit multiplication (godbolt) will need a much higher count
./* Do not emit expensive expressions. The rationale is that when someone writes a code like while (n > 45) n -= 45; he probably knows that n is not large, and does not want it to be turned into n %= 45. */ || expression_expensive_p (def))So GCC not doing this optimization is a feature – not a bug.
as
/ld
options for running on a 64-bit OS, and I had much bigger problems with the ancient GCC not understanding the system headers. I also enabled the DBX debug format instead of the UNIX/32V SDB format that GDB does not understand. But I did not need to make that big changes to Mikhail’s patch.-O
, -E
, -S
, -c
, -g
, -W
, -pedantic
, and -fomit-frame-pointer
does what you expect from using a modern GCC. All the options are documented in the manual page – you can format the text by passing the gcc.1
file to man
asman -l gcc.1
sudo apt install gcc-multilib
wget https://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-1.27.tar.bz2 wget https://gist.github.com/kristerw/b854b6d285e678452a44a6bcbf7ef86f/raw/gcc-1.27.patch tar xf gcc-1.27.tar.bz2 cd gcc-1.27 patch -p1 < ../gcc-1.27.patch
ln -s config-i386v.h config.h ln -s tm-i386v.h tm.h ln -s i386.md md ln -s output-i386.c aux-output.cYou may want to change where the compiler is installed by updating
bindir
and libdir
in the Makefile
. I set them tobindir = /home/kristerw/compilers/gcc-1.27/bin libdir = /home/kristerw/compilers/gcc-1.27/lib
makeWe then build it again using the newly built compiler
make stage1 make CC=stage1/gcc CFLAGS="-O -Bstage1/ -Iinclude"As a third, optional, step, we build it again using the second compiler and checks that the resulting binaries are identical with the second compiler (if not, then the compiler has miscompiled itself).
make stage2 make CC=stage2/gcc CFLAGS="-O -Bstage2/ -Iinclude" diff cpp stage2/cpp diff gcc stage2/gcc diff cc1 stage2/cc1We are now ready to install the compiler
make install
acme -o test.prg test.asmThe resulting program
test.prg
can be run in the VICE emulator by loading it using the “Smart attach Disk/Tape” option../configure make demosproduces two demo applications. Running the
shor
application asshor 456 427takes 11.0 seconds1 on my PC (Ubuntu 16.04, Intel Broadwell CPU).
shor
application is relatively sensitive to changes, and we can see surprising performance differences when modifying the code. For example, adding this functionvoid foo(int *a, int *b, int *c, int n) { for (int i = 0; i < n; i++) *a++ = *b++ + *c++; }at the top of
measure.c
makes the program need 11.8 seconds to run – it has become 7% slower by just adding an unused function!libquantum
, so I hardcoded a seed in order to measure the same work in each run: --- shor.c.orig 2007-09-03 08:47:55.000000000 +0200
+++ shor.c 2018-10-20 21:18:45.026076220 +0200
@@ -37,7 +37,7 @@
int N;
int c,q,a,b, factor;
- srand(time(0));
+ srand(123456);
if(argc == 1)
{
The running time of this is stable (11.0 seconds with a standard deviation of 0.014).string_view
String Split Implementation” that measures the performance for different ways of splitting a string.StringSplitStd
is about 9% faster than StringViewSplit
(they have the score 7829 and 7189), but the reason for this difference is that GCC has inlined everything for StringSplitStd
, but not for StringViewSplit
.StringViewSplitStd
and StringViewSplitPtr
from the benchmark makes the compiler make different inlining decisions, and we get the same performance for both StringSplitStd
and StringViewSplit
(quick-bench).void by_index(char* s, size_t n) { for (size_t i = 0; i < n; ++i) s[i] = tolower(s[i]); } static void LowerByIndex(benchmark::State& state) { // Code inside this loop is measured repeatedly for (auto _ : state) { char upper_string[] = "UPPER TEXT"; by_index(upper_string, 10); // Make sure the variable is not optimized away by compiler benchmark::DoNotOptimize(upper_string); } } // Register the function as a benchmark BENCHMARK(LowerByIndex);
tolower
function is defined in /user/include/ctype.h
asextern const __int32_t **__ctype_tolower_loc (void) throw () __attribute__ ((__const__)); extern __inline __attribute__ ((__gnu_inline__)) int tolower (int __c) throw () { return __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c; }This is inlined into
by_index
, which in turn is inlined into LowerByIndex
static void LowerByIndex(benchmark::State& state) { // Code inside this loop is measured repeatedly for (auto _ : state) { char upper_string[] = "UPPER TEXT"; for (size_t i = 0; i < 10; ++i) { int __c = upper_string[i]); upper_string[i] = __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c; } benchmark::DoNotOptimize(upper_string); } }
char
has values in the range -128 to 127 when compiling for the x86 architecture, so the compiler determines that the comparisons inupper_string[i] = __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c;
__c
is assigned from a char
), and the line is simplified toupper_string[i] = (*__ctype_tolower_loc())[__c];
__ctype_tolower_loc
function is decorated with the const
attribute so the function call can be moved out of the loop, provided the loop body is always executed. Compilers typically represent for-loops as an if-statement followed by a do-while loop – for examplefor (i = 0; i < n; ++i) { do_something(); }is represented as
i = 0; if (i < n) { do { do_something(); ++i; } while (i < n); }
__ctype_tolower_loc
out of the loops gives us the resultstatic void LowerByIndex(benchmark::State& state) { auto it = state.begin() if (it != state.end()) { int *p = *__ctype_tolower_loc(); // Code inside this loop is measured repeatedly do { char upper_string[] = "UPPER TEXT"; size_t i = 0; do { int __c = upper_string[i]); upper_string[i] = p[__c]; ++i; } while (i < 10); benchmark::DoNotOptimize(upper_string); ++it; } while (it != state.end()); } }
__ctype_tolower_loc
is now outside of the code segment being measured!static void LowerByIndex(benchmark::State&amp; state) { auto it = state.begin() if (it != state.end()) { int *p = *__ctype_tolower_loc(); // Code inside this loop is measured repeatedly do { char upper_string[] = "UPPER TEXT"; int __c0 = upper_string[0]); upper_string[0] = p[__c0]; int __c1 = upper_string[1]); upper_string[1] = p[__c1]; int __c2 = upper_string[2]); upper_string[2] = p[__c2]; int __c3 = upper_string[3]); upper_string[3] = p[__c3]; int __c4 = upper_string[4]); upper_string[4] = p[__c4]; int __c5 = upper_string[5]); upper_string[5] = p[__c5]; int __c6 = upper_string[6]); upper_string[6] = p[__c6]; int __c7 = upper_string[7]); upper_string[7] = p[__c7]; int __c8 = upper_string[8]); upper_string[8] = p[__c8]; int __c9 = upper_string[9]); upper_string[9] = p[__c9]; benchmark::DoNotOptimize(upper_string); ++it; } while (it != state.end()); } }
upper_string
are now to known positions, so the compiler can easily forward the values written to where they are read, and the generated code does not need to initialize or read from upper_string
static void LowerByIndex(benchmark::State&amp; state) { auto it = state.begin() if (it != state.end()) { int *p = *__ctype_tolower_loc(); // Code inside this loop is measured repeatedly do { char upper_string[10]; upper_string[0] = p['U']; upper_string[1] = p['P']; upper_string[2] = p['P']; upper_string[3] = p['E']; upper_string[4] = p['R']; upper_string[5] = p[' ']; upper_string[6] = p['T']; upper_string[7] = p['E']; upper_string[8] = p['X']; upper_string[9] = p['T']; benchmark::DoNotOptimize(upper_string); ++it; } while (it != state.end()); } }
p
static void LowerByIndex(benchmark::State&amp; state) { auto it = state.begin() if (it != state.end()) { int *p = *__ctype_tolower_loc(); // Code inside this loop is measured repeatedly do { char upper_string[10]; upper_string[0] = p['U']; upper_string[1] = upper_string[2] = p['P']; upper_string[3] = upper_string[7] = p['E']; upper_string[4] = p['R']; upper_string[5] = p[' ']; upper_string[6] = upper_string[9] = p['T']; upper_string[8] = p['X']; benchmark::DoNotOptimize(upper_string); ++it; } while (it != state.end()); } }
noinline
, but that only solves part of the problem – compilers do various interprocedural optimizations, and for GCC you should specify at least noclone
too. Other compilers may need to be restricted in different ways.volatile
or functionality from the benchmarking framework (such as benchmark::DoNotOptimize
and benchmark::ClobberMemory
), but this may also introduce unintended behavior. For example, these workarounds make the code look “unusual” to the compiler, which may make various heuristics make different optimization decisions compare to normal usage.With microbenchmarks, like a trial lawyer during cross-examination, you should never ask a question you don’t know the answer to (or at least have a pretty good idea of what it is). Real-world systems are generally too complex and intertwined to understand from surface measurements alone. If you have no idea how a system works at all, you don’t know what the right questions are, nor how to ask them, and any answers you get will be opaque at best, if not outright garbage. Microbenchmarks are a useful tool to confirm that an existing model is a good approximation to reality, but not very helpful in building these models to begin with.
tolower
instead of using the inlined version from ctype.h
. You can see the optimization from this blog post using libstdc++ too by including ctype.h
before any C++ header (but this is not possible in quick-bench, as it adds its own headers before the user code).