The performance of programs depend surprisingly much on things that “should not” matter (such as the order files are passed to the linker), and modifying one part of the program may change the performance of other parts.
The performance of the
Branch predictors keep track of the histories of branches by building various tables in hardware (see Dan Luu’s great blog post). But the branch prediction hardware looks only at a few bits of the addresses, so several branches may map to the same slot, with the result that the branch predictor cannot differentiate between them. Changing the size of one code segment makes other parts of the program move to different addresses, and some branches may be unlucky in that they now alias with branches behaving in a different way, which cause the performance to regress due to the excessive misprediction.
So it is often useful to dig deeper and understand why we get the performance result so that we can attack the root cause and not rely on being lucky – for the libquantum example we could ensure the branches in the sensitive part of the code are located near each other in memory (for example, by forcing the code to be inlined) or try to eliminate some branches by algorithmic changes.
An example
We will look at the libquantum library as an example. Configuring and compiling this as./configure make demosproduces two demo applications. Running the
shor
application asshor 456 427takes 11.0 seconds1 on my PC (Ubuntu 16.04, Intel Broadwell CPU).
The performance of the
shor
application is relatively sensitive to changes, and we can see surprising performance differences when modifying the code. For example, adding this functionvoid foo(int *a, int *b, int *c, int n) { for (int i = 0; i < n; i++) *a++ = *b++ + *c++; }at the top of
measure.c
makes the program need 11.8 seconds to run – it has become 7% slower by just adding an unused function!Why we get the performance difference
The only effect of adding the unused function is that some other functions are moved to different addresses2, so it is natural to attribute the performance difference to “cache effects“. But there are several other hardware bottlenecks that can cause this kind of performance variability – this particular case is due to branch prediction.Branch predictors keep track of the histories of branches by building various tables in hardware (see Dan Luu’s great blog post). But the branch prediction hardware looks only at a few bits of the addresses, so several branches may map to the same slot, with the result that the branch predictor cannot differentiate between them. Changing the size of one code segment makes other parts of the program move to different addresses, and some branches may be unlucky in that they now alias with branches behaving in a different way, which cause the performance to regress due to the excessive misprediction.
Impact of this when optimizing
All books and conference talks about optimization tell you to always measure your change to verify that it actually improves the performance before committing. This is good advice, but just measuring the running time is not enough! For example, if our change to libquantum was a real improvement that made the code 5% faster, then we would still have seen a 2% regression due to the (unrelated) branch prediction issues. But this regression is just us being unlucky with code placement, and further changes may make us lucky again, so it would probably have been better long-term to do the change. Similarly, I have seen too many bad changes done because “we measured it, and this made the program faster (but we do not understand why...)“.So it is often useful to dig deeper and understand why we get the performance result so that we can attack the root cause and not rely on being lucky – for the libquantum example we could ensure the branches in the sensitive part of the code are located near each other in memory (for example, by forcing the code to be inlined) or try to eliminate some branches by algorithmic changes.
Further reading
- “Producing Wrong Data Without Doing Anything Obviously Wrong!” has a more detailed discussion about these issues.
- “Intel 64 and IA-32 Architectures Optimization Reference Manual” describes many of the hardware bottlenecks for Intel CPUs. The principles are (mostly) the same for other CPUs, although the details are different.
1. There are some randomized algorithms in
libquantum
, so I hardcoded a seed in order to measure the same work in each run: --- shor.c.orig 2007-09-03 08:47:55.000000000 +0200
+++ shor.c 2018-10-20 21:18:45.026076220 +0200
@@ -37,7 +37,7 @@
int N;
int c,q,a,b, factor;
- srand(time(0));
+ srand(123456);
if(argc == 1)
{
The running time of this is stable (11.0 seconds with a standard deviation of 0.014).
2. It could also have made the compiler make different inlining decisions etc., but I have verified that the only difference for this program is that some code is placed on different addresses compared to the original version.
I am definitely guilty of “we measured it, and this made the program faster (but we do not understand why...)“.
ReplyDeleteDo you know of a good way of displaying relevant metrics that can explain mysterious performance deviations? For example showing branch prediction hits/misses.
I'm using the Linux \(\verb!perf!\) tool – it reads the CPU’s hardware counters and can annotate the assembly with information about the cost of the different instructions.
DeleteThe tool is powerful, but is very annoying when you start using it... I would suggest starting by watching Chandler Carruth’s CppCon 2015: talk “Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!” if you have not used it before...