Saturday, August 11, 2018

Inlining and microbenchmarking

Inlining is important for C++ performance, but the compiler must be careful not to increase the code size too much. GCC’s inlining process is basically inlining functions sorted by priority until a growth percentage limit is hit (or until all relevant functions have been inlined), which works fine for large translation units where the parts of the code not helped by inlining can compensate for the code increase in the parts where inlining helps. But it works less well for small translating units that need much inlining, which is common when doing microbenchmarks.

Take for example this quick-bench benchmark from Bartlomiej Filipek’s blog post “Speeding Up string_view String Split Implementation” that measures the performance for different ways of splitting a string.

We can see that StringSplitStd is about 9% faster than StringViewSplit (they have the score 7829 and 7189), but the reason for this difference is that GCC has inlined everything for StringSplitStd, but not for StringViewSplit.

We get a different result if we run the benchmark with a different set of functions. For example, removing StringViewSplitStd and StringViewSplitPtr from the benchmark makes the compiler make different inlining decisions, and we get the same performance for both StringSplitStd and StringViewSplit (quick-bench).

It is a good idea when doing microbenchmarking to check that the compiler makes the same inlining decisions as when the code is used in a real use case (in a realistically sized translation unit).

1 comment:

  1. In what circumstances would overall growth percentage be effective at predicting whether in-lining would be beneficial?

    If a function is called a huge number of times from within a compact loop, it may be expanded almost to the size of the CPU's code cache without any cache-related performance degradation. If, however, it is called once for each iteration of a loop that would barely fit in the CPU's code cache, expanding it to such a size could result in the CPU having to reloading it into cache on every iteration, and then having to reload whatever it had displaced. If both the function and the calling code happen to be in the same compilation unit, the former situation might cause 100:1 expansion of the code and yet be good from an optimization perspective, while the latter might show up as only a 2:1 expansion and yet be absolutely disastrous.

    Perhaps it might be helpful to have an option to produce two extra externally-callable versions of a function, one of which would be intended to minimize cache usage and the other optimized on the presumption that it could afford to use the majority of code cache, and then have client code (which might be in a separate translation unit) call whichever function would be more appropriate for a given scenario while defining weak symbols to equate the alternate versions to a normal one (so that if only one version is defined, client code would use it, but if multiple versions are defined, client would use whichever was most appropriate).

    Such a design could also facilitate microbenchmarking if small benchmarking loops could explicitly specify which version of the function they wanted to call. What do you think?