Software inefficiencies often show up as wasteful memory operations, and many research tools can detect various cases of this. One such tool is described in the paper “Watching for Software Inefficiencies with Witch”, which is using the Intel CPU performance monitoring units (PMU) and debug registers to find
I thought it would be fun to play with this, but the “Witch” tools need a custom Linux kernel which I found too complicated to set up… But I have always wanted to learn how to write a Valgrind tool, and writing a Witch-like tool seemed like a good beginner project (although the main contribution in the “Witch” paper is doing this without performance overhead – that is definitely not the case when using Valgrind…).
Silent loads and stores are, however, tracked on instruction granularity. The reason is that byte granularity would report spurious silent loads/stores. For example, if
The result is written to a JSON file containing all data collected during the run. There is an example python script
The name of the output file is per default
The tool per default tracks the instructions that read/write the memory, but this is often not too useful as it tends to report things like
Most of the Valgrind core options are also supported, such as
One other limitation of the
The statistics for each instruction is kept in the
Each byte accessed by the program gets an
The code in
The helper functions used by the
The only thing left to do for the silent store tracker is to check that all addresses have been written to (in order to prevent reporting spurious silent stores for, for example, the first write to
The
- dead stores – writing a value that is never read,
- silent stores – writing the same value that was already present at the memory location,
- silent loads – reading an unchanged value from memory that was previously read.
*p = expensive_computation();where it may make sense to reorganize the code to only do the expensive computation when the value is needed.
I thought it would be fun to play with this, but the “Witch” tools need a custom Linux kernel which I found too complicated to set up… But I have always wanted to learn how to write a Valgrind tool, and writing a Witch-like tool seemed like a good beginner project (although the main contribution in the “Witch” paper is doing this without performance overhead – that is definitely not the case when using Valgrind…).
The deadstores Valgrind tool
My Valgrind tool, “deadstores”, gathers data from each instruction that is reading and writing memory:- For stores, it records how many bytes were written, and how many of those bytes were read before the memory was overwritten (or made invalid by calling
free
, etc.). It also records how many times the instruction was executed, and how many of those wrote the same value as was there before. - For loads, it records how many times the instruction was executed, and how many of those read a value that had previously been read.
struct { int a, b, c, d; } foo;and code that is initializing the structure as
memset(&foo, 0, sizeof(foo));
then the compiler will usually optimize the memset
to one SIMD instruction that is clearing all 128 bits. If the program now only reads three of the values, then we will see that 4 of the 16 bytes are dead – instruction granularity would not detect this as the stored value is (partly) used.Silent loads and stores are, however, tracked on instruction granularity. The reason is that byte granularity would report spurious silent loads/stores. For example, if
sum
is a 32-bit int
variable containing the value 15
, and we add 10
to itsum += 10;then this would be reported as 3 silent stores if we used byte granularity (as three of the bytes are
0
both before and after the operation).How to use the tool
Thedeadstores
tool is used in the same way as other Valgrind tools
valgrind --tool=deadstores [valgrind-options] your-prog [your-prog-options]The program to analyze must be built with debug symbols (
-g
or better), and optimizations should be enabled in order to produce relevant data. You also need -fno-omit-frame-pointer
if you are passing --use-stack-trace=yes
(see below) to the tool.The result is written to a JSON file containing all data collected during the run. There is an example python script
script/print_result.py
that extracts the top \(n\) instructions from each of “dead stores”, “silent stores”, and “silent loads”.The name of the output file is per default
deadstores.out.<pid>
(where <pid>
is the program’s process ID), but it can be changed by --deadstores-out-file=<filename>
. The %p
, %q
, and %n
format specifiers can be used to embed the process ID, the contents of an environment variable, and/or a sequence number in the name, in the same ways as for the core option --log-file
.The tool per default tracks the instructions that read/write the memory, but this is often not too useful as it tends to report things like
0x054b0a25: bytes_written: 11916560 bytes_read: 248126 bytes_dead: 11668434 nof_stores: 744785 nof_silent: 19326 at 0x054b0a25: __memset_avx2 (in /build/glibc-Cl5G7W/glibc-2.23/string/../sysdeps/x86_64/multiarch/memset-avx2.S:85)The dead stores here come from the
memset
implementation in libc
, and the values are the sums of all calls to memset
in the program. This is not that helpful if we want to find dead stores we can eliminate, but the tool can instead track the loads/stores by the stack trace, which splits the reported data for an instruction according to where it was called from0x054b0a25: bytes_written: 1707072 bytes_read: 1120 bytes_dead: 1705952 nof_stores: 106692 nof_silent: 2500 at 0x054b0a25: __memset_avx2 (in /build/glibc-Cl5G7W/glibc-2.23/string/../sysdeps/x86_64/multiarch/memset-avx2.S:85) by 0x00b317aa: poison_pages() (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc-page.c:2125) by 0x00b3362c: ggc_collect() (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc-page.c:2226) by 0x00bb82d4: cgraph_node::finalize_function(tree_node*, bool) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/cgraphunit.c:494) by 0x00a60e43: expand_or_defer_fn(tree_node*) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/cp/semantics.c:4486) 0x054b0a25: bytes_written: 521808 bytes_read: 12 bytes_dead: 521796 nof_stores: 32613 nof_silent: 0 at 0x054b0a25: __memset_avx2 (in /build/glibc-Cl5G7W/glibc-2.23/string/../sysdeps/x86_64/multiarch/memset-avx2.S:85) by 0x00d127fe: ggc_internal_cleared_alloc(unsigned long, void (*)(void*), unsigned long, unsigned long) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc-common.c:118) by 0x012b796e: make_node(tree_code) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc.h:143) by 0x012b7c96: build_tree_list(tree_node*, tree_node*) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/tree.c:3184) by 0x009e32d6: cp_parser_parameter_declaration_list(cp_parser*, int) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/cp/parser.c:22525) ...This is enabled by
--use-stack-trace=yes
which in general makes the result much more useful (and the tool much slower…). The program to analyze must be compiled width frame pointer support (-fno-omit-frame-pointer
) in order to get a reliable result.Most of the Valgrind core options are also supported, such as
--trace-children=yes
which is useful when analyzing programs (such as gcc
) that start subprocesses using the exec
system call.Limitations
Valgrind is not an optimal framework for writing this kind of tool. The reason is that it works by translating the assembly instructions to an internal, target-independent, representation before it is analyzed/instrumented by the tools, and this translation does not preserve all details of the instructions. This is not a problem for most instructions – for example, a simple addition instructionadd %rcx,%raxis translated in a natural way to an IR instruction adding two values
t10 = Add64(t7,t4)But some instructions are translated in a way that causes problems. For example, the bit test instruction
bt %rax,%r8is generated as a sequence that is storing 8 bytes on an unused address on the stack and reading back one byte of the result
t75 = GET:I64(48) ; <-- Get stack pointer t74 = Sub64(t75,0x120:I64) PUT(48) = t74 STle(t74) = t73 ; <-- Store 8 bytes t18 = And64(t36,0x3F:I64) t78 = Sar64(t18,0x3:I8) t77 = Add64(t74,t78) t80 = And64(t18,0x7:I64) t112 = 64to8(t80) t79 = t112 t15 = LDle:I8(t77) ; <-- Load 1 byte t84 = GET:I64(168) t113 = amd64g_calculate_rflags_all[mcx=0x9]{0x581732b0}(0x13:I64,t61,0x0:I64,t84):I64 t85 = t113 t114 = 8Uto64(t15) t89 = t114 t88 = Shr64(t89,t79) t87 = And64(t88,0x1:I64) t90 = And64(t85,0x8D4:I64) t86 = Or64(t90,t87) PUT(144) = 0x0:I64 PUT(160) = 0x0:I64 PUT(152) = t86 PUT(168) = 0x0:I64 t91 = Add64(t74,0x120:I64) PUT(48) = t91This makes the
deadstores
tool treat this as an 8-byte store where 7 of those bytes are dead… I have worked around this particular case (by ignoring stores to offset 0
from the stack pointer – this holds the pointer to the stack frame for normal functions, so these memory accesses are not relevant for the tool anyway), but there may be additional cases I have not found yet…One other limitation of the
deadstores
tool is that the tracking of silent loads only handles memory addresses that have been stored to, so it does not detect silent loads in read-only data segments. This could be fixed by tracking valid memory in the same way as done by the memcheck
tool.Implementation details
Structures
The tool keeps track of statistics for each instruction reading/writing memory (number of bytes written, number of times executed, etc.) and the status of each byte written (to track if the stored value is later read or re-written with the same value).The statistics for each instruction is kept in the
node
structurestruct node { handle parent; handle left; handle right; enum color color; struct key key; // Store Long bytes_written; Long bytes_read; Long nof_stores; Long nof_silent_stores; // Load Long nof_loads; Long nof_silent_loads; };The
node
structures are stored in an array called nodes
, and the nodes are organized as a red-black tree keyed by the address of the instruction (or rather, “generalized address” – it contains the top 5 addresses of the stack trace when --use-stack-trace=yes
is used). The tree structure is managed by the parent
, left
, and right
fields – these are conceptually pointers to node
structures, but we are using handles (that are just the index into the nodes
array) to make them valid even if we need to realloc
the nodes
array to increase its size.Each byte accessed by the program gets an
addr_info
structure for that address (these are allocated in blocks of 1<<20
addresses, and the blocks are stored in a hash table addr_info_table
).
struct addr_info { unsigned store_instr_h : 31; unsigned is_written : 1; };The
addr_info
structure contains 1 bit telling if the byte is valid (that is, if it has been written to, and not been invalidated by the memory being free
ed etc.), and a handle to the instruction that wrote to this byte. Programs do not use that many load/store instructions, so the handle is stored in 31 bits to make the structure as structure fit in 32 bits. This makes a big difference when tracing memory-hungry programs...Instrumentation
Valgrind work by transforming the binary code to its own internal representation (IR) and the tools can then add instrumentation to the IR before Valgrind JIT-compiles and executes the code. Thedeadstores
tool adds its instrumentation in the function ds_instrument
.The code in
ds_instrument
iterates through the IR to find the instructions that access memory – the most important are Ist_WrTemp
(load) and Ist_Store
(store), but there are a few others (such as the atomic compare-and-swap Ist_CAS
). Each such instruction is instrumented using the Ist_Dirty
IR instruction that is used to call a helper function. The deadstores
tool has one such helper function for tracking each of “dead stores”, “silent stores”, and “loads” (each instruction that writes memory gets both of the “dead stores” and “silent stores” helper functions, and the “silent stores” helper function must be called first).The helper functions used by the
deadstores
tool take three parameters – the address and size of the memory access, and the address of the instruction. The function that is tracking dead stores updates the counters for the instruction (that the instruction has been executed and written size
bytes) and the status of each written byte (that it has been written, and which instruction that wrote the value).static void dead_store_tracker(Addr addr, SizeT size, Addr iaddr) { handle h = iaddr2handle(iaddr); nodes[h].nof_stores += 1; nodes[h].bytes_written += size; for (SizeT i = 0; i < size; i++) { struct addr_info *ai = addr2addr_info(addr + i); ai->is_written = True; ai->store_instr_h = h; } }The load tracking checks each byte’s
store_instr_h
to see which instruction wrote this byte. If this is different from 0
, then the instruction corresponding to the handle has its bytes_read
counter incremented, and the byte’s store_instr_h
is set to 0
to mark it as read. If all of the bytes were written but has a zero for store_instr_h
, then the load is silent, and the load instruction’s counter is updated accordingly.static void load_tracker(Addr addr, SizeT size, Addr iaddr) { SizeT silent_load = 0; for (SizeT i = 0; i < size; i++) { struct addr_info *ai = addr2addr_info(addr + i); if (ai->store_instr_h != 0) { nodes[ai->store_instr_h].bytes_read++; ai->store_instr_h = 0; } else if (ai->is_written) { silent_load += 1; } } handle h = iaddr2handle(iaddr); nodes[h].nof_loads += 1; if (silent_load == size) { nodes[h].nof_silent_loads += 1; } }The tracking of silent stores needs a bit more work as it must compare the value of the original memory with the newly written. This is cannot be done in the helper function, so it is done by inserting new IR that does the comparison, and the helper function is not called unless the values are identical. This functionality is partially supported by Valgrind – the
Ist_Dirty
instruction has an optional guard
parameter taking an IR expression determining if the function is to be called, so the only thing needed is to generate the comparison and pass it as the guard. And this comparison is straight forward to emit – if st
is an IRExpr
pointer to a 32-bit Ist_Store
, then the IR for the comparison can be emitted asIRExpr *data = st->Ist.Store.data; IRType type = typeOfIRExpr(tyenv, data); IRExpr *addr = st->Ist.Store.addr; tl_assert(type == Ity_I32); IRExpr *load = emit_load(sb, type, addr); IRExpr *guard = emit_binop(sb, Iop_CmpEQ32, data, load);where the
emit_
-functions are simple wrappers around the Valgrind API to get rid of some boilerplate code (sb
is the IR superblock containing the instructions – this is one of the inputs to ds_instrument
).The only thing left to do for the silent store tracker is to check that all addresses have been written to (in order to prevent reporting spurious silent stores for, for example, the first write to
malloc
ed memory that by luck happens to contain the same value that is stored by the instruction), and update the instruction’s silent_stores
counter if they have.static void silent_store_tracker(Addr addr, SizeT size, Addr iaddr) { Bool is_written = True; for (SizeT i = 0; i < size; i++) { struct addr_info *ai = addr2addr_info(addr + i); is_written = is_written && ai->is_written; } if (is_written) { handle h = iaddr2handle(iaddr); nodes[h].nof_silent_stores += 1; } }
Callbacks
It is possible to register callbacks that Valgrind calls for various things that cannot be done in the IR instrumentation (or that most tools must do, so it is better to have a common mechanism for this). Some examples of this aretrack_new_mem_stack
and track_die_mem_stack
that are called each time the stack pointer is decreased/increased, or the needs_malloc_replacement
callbacks that are called when the program is calling malloc
, free
, etc.The
deadstores
tool is using these in order to mark the memory as invalid (that is, clearing the is_written
flag) to prevent counters being incorrectly updated by the silent load/store tracking when, for example, previously freed memory is returned from malloc
.TODO
There are many things that could be improved:- The tool is slow. I have not done any profiling, but I would guess most of the time is spent in
iaddr2handle
andaddr2addr_info
. It is easy to improve how this is done. - Add a test suite.
- Better tracking of valid memory (as described in the “limitations” section above).
- Better tracking of variables and structures (to automatically find write-only variables and structure elements).
- Finding other types of inefficiencies, such as
memcpy
where the program could have used the original data. - …