Sunday, February 2, 2020

Watching for software inefficiencies with Valgrind

Software inefficiencies often show up as wasteful memory operations, and many research tools can detect various cases of this. One such tool is described in the paper “Watching for Software Inefficiencies with Witch”, which is using the Intel CPU performance monitoring units (PMU) and debug registers to find
  • dead stores – writing a value that is never read,
  • silent stores – writing the same value that was already present at the memory location,
  • silent loads – reading an unchanged value from memory that was previously read.
Most of the dead/silent memory operations are not problematic, and the analyzed program would most likely become slower if we tried to get rid of them by making the code more complicated. But there are cases, such as mostly dead stores of the form
*p = expensive_computation();
where it may make sense to reorganize the code to only do the expensive computation when the value is needed.

I thought it would be fun to play with this, but the “Witch” tools need a custom Linux kernel which I found too complicated to set up… But I have always wanted to learn how to write a Valgrind tool, and writing a Witch-like tool seemed like a good beginner project (although the main contribution in the “Witch” paper is doing this without performance overhead – that is definitely not the case when using Valgrind…).

The deadstores Valgrind tool

My Valgrind tool, “deadstores”, gathers data from each instruction that is reading and writing memory:
  • For stores, it records how many bytes were written, and how many of those bytes were read before the memory was overwritten (or made invalid by calling free, etc.). It also records how many times the instruction was executed, and how many of those wrote the same value as was there before.
  • For loads, it records how many times the instruction was executed, and how many of those read a value that had previously been read.
The reason dead stores are tracked on a byte granularity is that I want to be able to find obsolete, “write-only”, structure elements that are never read. For example, if we have a structure such as
struct {
  int a, b, c, d;
} foo;
and code that is initializing the structure as
memset(&foo, 0, sizeof(foo));
then the compiler will usually optimize the memset to one SIMD instruction that is clearing all 128 bits. If the program now only reads three of the values, then we will see that 4 of the 16 bytes are dead – instruction granularity would not detect this as the stored value is (partly) used.

Silent loads and stores are, however, tracked on instruction granularity. The reason is that byte granularity would report spurious silent loads/stores. For example, if sum is a 32-bit int variable containing the value 15, and we add 10 to it
sum += 10;
then this would be reported as 3 silent stores if we used byte granularity (as three of the bytes are 0 both before and after the operation).

How to use the tool

The deadstores tool is used in the same way as other Valgrind tools
valgrind --tool=deadstores [valgrind-options] your-prog [your-prog-options]
The program to analyze must be built with debug symbols (-g or better), and optimizations should be enabled in order to produce relevant data. You also need -fno-omit-frame-pointer if you are passing --use-stack-trace=yes (see below) to the tool.

The result is written to a JSON file containing all data collected during the run. There is an example python script script/print_result.py that extracts the top \(n\) instructions from each of “dead stores”, “silent stores”, and “silent loads”.

The name of the output file is per default deadstores.out.<pid> (where <pid> is the program’s process ID), but it can be changed by --deadstores-out-file=<filename>. The %p, %q, and %n format specifiers can be used to embed the process ID, the contents of an environment variable, and/or a sequence number in the name, in the same ways as for the core option --log-file.

The tool per default tracks the instructions that read/write the memory, but this is often not too useful as it tends to report things like
0x054b0a25:
  bytes_written: 11916560
  bytes_read:      248126
  bytes_dead:    11668434
  nof_stores:      744785
  nof_silent:       19326
    at 0x054b0a25: __memset_avx2 (in /build/glibc-Cl5G7W/glibc-2.23/string/../sysdeps/x86_64/multiarch/memset-avx2.S:85)
The dead stores here come from the memset implementation in libc, and the values are the sums of all calls to memset in the program. This is not that helpful if we want to find dead stores we can eliminate, but the tool can instead track the loads/stores by the stack trace, which splits the reported data for an instruction according to where it was called from
0x054b0a25:
  bytes_written:  1707072
  bytes_read:        1120
  bytes_dead:     1705952
  nof_stores:      106692
  nof_silent:        2500
    at 0x054b0a25: __memset_avx2 (in /build/glibc-Cl5G7W/glibc-2.23/string/../sysdeps/x86_64/multiarch/memset-avx2.S:85)
    by 0x00b317aa: poison_pages() (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc-page.c:2125)
    by 0x00b3362c: ggc_collect() (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc-page.c:2226)
    by 0x00bb82d4: cgraph_node::finalize_function(tree_node*, bool) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/cgraphunit.c:494)
    by 0x00a60e43: expand_or_defer_fn(tree_node*) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/cp/semantics.c:4486)

0x054b0a25:
  bytes_written:   521808
  bytes_read:          12
  bytes_dead:      521796
  nof_stores:       32613
  nof_silent:           0
    at 0x054b0a25: __memset_avx2 (in /build/glibc-Cl5G7W/glibc-2.23/string/../sysdeps/x86_64/multiarch/memset-avx2.S:85)
    by 0x00d127fe: ggc_internal_cleared_alloc(unsigned long, void (*)(void*), unsigned long, unsigned long) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc-common.c:118)
    by 0x012b796e: make_node(tree_code) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc.h:143)
    by 0x012b7c96: build_tree_list(tree_node*, tree_node*) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/tree.c:3184)
    by 0x009e32d6: cp_parser_parameter_declaration_list(cp_parser*, int) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/cp/parser.c:22525)

...
This is enabled by --use-stack-trace=yes which in general makes the result much more useful (and the tool much slower…). The program to analyze must be compiled width frame pointer support (-fno-omit-frame-pointer) in order to get a reliable result.

Most of the Valgrind core options are also supported, such as --trace-children=yes which is useful when analyzing programs (such as gcc) that start subprocesses using the exec system call.

Limitations

Valgrind is not an optimal framework for writing this kind of tool. The reason is that it works by translating the assembly instructions to an internal, target-independent, representation before it is analyzed/instrumented by the tools, and this translation does not preserve all details of the instructions. This is not a problem for most instructions – for example, a simple addition instruction
add %rcx,%rax
is translated in a natural way to an IR instruction adding two values
t10 = Add64(t7,t4)
But some instructions are translated in a way that causes problems. For example, the bit test instruction
bt %rax,%r8
is generated as a sequence that is storing 8 bytes on an unused address on the stack and reading back one byte of the result
t75 = GET:I64(48)          ; <-- Get stack pointer
t74 = Sub64(t75,0x120:I64)
PUT(48) = t74
STle(t74) = t73            ; <-- Store 8 bytes
t18 = And64(t36,0x3F:I64)
t78 = Sar64(t18,0x3:I8)
t77 = Add64(t74,t78)
t80 = And64(t18,0x7:I64)
t112 = 64to8(t80)
t79 = t112
t15 = LDle:I8(t77)         ; <-- Load 1 byte
t84 = GET:I64(168)
t113 = amd64g_calculate_rflags_all[mcx=0x9]{0x581732b0}(0x13:I64,t61,0x0:I64,t84):I64
t85 = t113
t114 = 8Uto64(t15)
t89 = t114
t88 = Shr64(t89,t79)
t87 = And64(t88,0x1:I64)
t90 = And64(t85,0x8D4:I64)
t86 = Or64(t90,t87)
PUT(144) = 0x0:I64
PUT(160) = 0x0:I64
PUT(152) = t86
PUT(168) = 0x0:I64
t91 = Add64(t74,0x120:I64)
PUT(48) = t91
This makes the deadstores tool treat this as an 8-byte store where 7 of those bytes are dead… I have worked around this particular case (by ignoring stores to offset 0 from the stack pointer – this holds the pointer to the stack frame for normal functions, so these memory accesses are not relevant for the tool anyway), but there may be additional cases I have not found yet…

One other limitation of the deadstores tool is that the tracking of silent loads only handles memory addresses that have been stored to, so it does not detect silent loads in read-only data segments. This could be fixed by tracking valid memory in the same way as done by the memcheck tool.

Implementation details

Structures

The tool keeps track of statistics for each instruction reading/writing memory (number of bytes written, number of times executed, etc.) and the status of each byte written (to track if the stored value is later read or re-written with the same value).

The statistics for each instruction is kept in the node structure
struct node {
  handle parent;
  handle left;
  handle right;
  enum color color;
  struct key key;

  // Store
  Long bytes_written;
  Long bytes_read;
  Long nof_stores;
  Long nof_silent_stores;

  // Load
  Long nof_loads;
  Long nof_silent_loads;
};
The node structures are stored in an array called nodes, and the nodes are organized as a red-black tree keyed by the address of the instruction (or rather, “generalized address” – it contains the top 5 addresses of the stack trace when --use-stack-trace=yes is used). The tree structure is managed by the parent, left, and right fields – these are conceptually pointers to node structures, but we are using handles (that are just the index into the nodes array) to make them valid even if we need to realloc the nodes array to increase its size.

Each byte accessed by the program gets an addr_info structure for that address (these are allocated in blocks of 1<<20 addresses, and the blocks are stored in a hash table addr_info_table).
struct addr_info {
  unsigned store_instr_h : 31;
  unsigned is_written : 1;
};
The addr_info structure contains 1 bit telling if the byte is valid (that is, if it has been written to, and not been invalidated by the memory being freeed etc.), and a handle to the instruction that wrote to this byte. Programs do not use that many load/store instructions, so the handle is stored in 31 bits to make the structure as structure fit in 32 bits. This makes a big difference when tracing memory-hungry programs...

Instrumentation

Valgrind work by transforming the binary code to its own internal representation (IR) and the tools can then add instrumentation to the IR before Valgrind JIT-compiles and executes the code. The deadstores tool adds its instrumentation in the function ds_instrument.

The code in ds_instrument iterates through the IR to find the instructions that access memory – the most important are Ist_WrTemp (load) and Ist_Store (store), but there are a few others (such as the atomic compare-and-swap Ist_CAS). Each such instruction is instrumented using the Ist_Dirty IR instruction that is used to call a helper function. The deadstores tool has one such helper function for tracking each of “dead stores”, “silent stores”, and “loads” (each instruction that writes memory gets both of the “dead stores” and “silent stores” helper functions, and the “silent stores” helper function must be called first).

The helper functions used by the deadstores tool take three parameters – the address and size of the memory access, and the address of the instruction. The function that is tracking dead stores updates the counters for the instruction (that the instruction has been executed and written size bytes) and the status of each written byte (that it has been written, and which instruction that wrote the value).
static void dead_store_tracker(Addr addr, SizeT size, Addr iaddr)
{
  handle h = iaddr2handle(iaddr);
  nodes[h].nof_stores += 1;
  nodes[h].bytes_written += size;

  for (SizeT i = 0; i < size; i++) {
    struct addr_info *ai = addr2addr_info(addr + i);
    ai->is_written = True;
    ai->store_instr_h = h;
  }
}
The load tracking checks each byte’s store_instr_h to see which instruction wrote this byte. If this is different from 0, then the instruction corresponding to the handle has its bytes_read counter incremented, and the byte’s store_instr_h is set to 0 to mark it as read. If all of the bytes were written but has a zero for store_instr_h, then the load is silent, and the load instruction’s counter is updated accordingly.
static void load_tracker(Addr addr, SizeT size, Addr iaddr)
{
  SizeT silent_load = 0;
  for (SizeT i = 0; i < size; i++) {
    struct addr_info *ai = addr2addr_info(addr + i);
    if (ai->store_instr_h != 0) {
      nodes[ai->store_instr_h].bytes_read++;
      ai->store_instr_h = 0;
    } else if (ai->is_written) {
      silent_load += 1;
    }
  }

  handle h = iaddr2handle(iaddr);
  nodes[h].nof_loads += 1;
  if (silent_load == size) {
    nodes[h].nof_silent_loads += 1;
  }
}
The tracking of silent stores needs a bit more work as it must compare the value of the original memory with the newly written. This is cannot be done in the helper function, so it is done by inserting new IR that does the comparison, and the helper function is not called unless the values are identical. This functionality is partially supported by Valgrind – the Ist_Dirty instruction has an optional guard parameter taking an IR expression determining if the function is to be called, so the only thing needed is to generate the comparison and pass it as the guard. And this comparison is straight forward to emit – if st is an IRExpr pointer to a 32-bit Ist_Store, then the IR for the comparison can be emitted as
IRExpr *data = st->Ist.Store.data;
IRType type = typeOfIRExpr(tyenv, data);
IRExpr *addr = st->Ist.Store.addr;
tl_assert(type == Ity_I32);
IRExpr *load = emit_load(sb, type, addr);
IRExpr *guard = emit_binop(sb, Iop_CmpEQ32, data, load);
where the emit_-functions are simple wrappers around the Valgrind API to get rid of some boilerplate code (sb is the IR superblock containing the instructions – this is one of the inputs to ds_instrument).

The only thing left to do for the silent store tracker is to check that all addresses have been written to (in order to prevent reporting spurious silent stores for, for example, the first write to malloced memory that by luck happens to contain the same value that is stored by the instruction), and update the instruction’s silent_stores counter if they have.
static void silent_store_tracker(Addr addr, SizeT size, Addr iaddr)
{
  Bool is_written = True;
  for (SizeT i = 0; i < size; i++) {
    struct addr_info *ai = addr2addr_info(addr + i);
    is_written = is_written && ai->is_written;
  }
  if (is_written) {
    handle h = iaddr2handle(iaddr);
    nodes[h].nof_silent_stores += 1;
  }
}

Callbacks

It is possible to register callbacks that Valgrind calls for various things that cannot be done in the IR instrumentation (or that most tools must do, so it is better to have a common mechanism for this). Some examples of this are track_new_mem_stack and track_die_mem_stack that are called each time the stack pointer is decreased/increased, or the needs_malloc_replacement callbacks that are called when the program is calling malloc, free, etc.

The deadstores tool is using these in order to mark the memory as invalid (that is, clearing the is_written flag) to prevent counters being incorrectly updated by the silent load/store tracking when, for example, previously freed memory is returned from malloc.

TODO

There are many things that could be improved:
  • The tool is slow. I have not done any profiling, but I would guess most of the time is spent in iaddr2handle and addr2addr_info. It is easy to improve how this is done.
  • Add a test suite.
  • Better tracking of valid memory (as described in the “limitations” section above).
  • Better tracking of variables and structures (to automatically find write-only variables and structure elements).
  • Finding other types of inefficiencies, such as memcpy where the program could have used the original data.
But I do not expect to do this anytime soon (if ever…) as my main goal was just to learn how to write a Valgrind tool…

1 comment:

  1. Great article! Out of curiosity, did you run it on any existing popular programs, and if so, did you find any offenders? Not my area of expertise but I wonder if this could be particularly useful for finding optimizations in interpreted e.g. python code

    ReplyDelete