## Sunday, October 21, 2018

### Optimization and performance measurement

The performance of programs depend surprisingly much on things that “should not” matter (such as the order files are passed to the linker), and modifying one part of the program may change the performance of other parts.

### An example

We will look at the libquantum library as an example. Configuring and compiling this as
./configure
make demos

produces two demo applications. Running the shor application as
shor 456 427

takes 11.0 seconds1 on my PC (Ubuntu 16.04, Intel Broadwell CPU).

The performance of the shor application is relatively sensitive to changes, and we can see surprising performance differences when modifying the code. For example, adding this function
void foo(int *a, int *b, int *c, int n)
{
for (int i = 0; i < n; i++)
*a++ = *b++ + *c++;
}

at the top of measure.c makes the program need 11.8 seconds to run – it has become 7% slower by just adding an unused function!

### Why we get the performance difference

The only effect of adding the unused function is that some other functions are moved to different addresses2, so it is natural to attribute the performance difference to “cache effects“. But there are several other hardware bottlenecks that can cause this kind of performance variability – this particular case is due to branch prediction.

Branch predictors keep track of the histories of branches by building various tables in hardware (see Dan Luu’s great blog post). But the branch prediction hardware looks only at a few bits of the addresses, so several branches may map to the same slot, with the result that the branch predictor cannot differentiate between them. Changing the size of one code segment makes other parts of the program move to different addresses, and some branches may be unlucky in that they now alias with branches behaving in a different way, which cause the performance to regress due to the excessive misprediction.

### Impact of this when optimizing

All books and conference talks about optimization tell you to always measure your change to verify that it actually improves the performance before committing. This is a good advice, but just measuring the running time is not enough! For example, if our change to libquantum was a real improvement that made the code 5% faster, then we would still have seen a 2% regression due to the (unrelated) branch prediction issues. But this regression is just us being unlucky with code placement, and further changes may make us lucky again, so it would probably have been better long-term to do the change. Similarly, I have seen too many bad changes done because “we measured it, and this made the program faster (but we do not understand why...)“.

So it is often useful to dig deeper and understand why we get the performance result so that we can attack the root cause and not rely on being lucky – for the libquantum example we could ensure the branches in the sensitive part of the code are located near each other in memory (for example, by forcing the code to be inlined) or try to eliminate some branches by algorithmic changes.

1. There are some randomized algorithms in libquantum, so I hardcoded a seed in order to measure the same work in each run:
--- shor.c.orig 2007-09-03 08:47:55.000000000 +0200
+++ shor.c 2018-10-20 21:18:45.026076220 +0200
@@ -37,7 +37,7 @@
int N;
int c,q,a,b, factor;

-  srand(time(0));
+  srand(123456);

if(argc == 1)
{

The running time of this is stable (11.0 seconds with a standard deviation of 0.014).
2. It could also have made the compiler make different inlining decisions etc., but I have verified that the only difference for this program is that some code is placed on different addresses compared to the original version.

1. I'm using the Linux $$\verb!perf!$$ tool – it reads the CPU’s hardware counters and can annotate the assembly with information about the cost of the different instructions.