## Tuesday, May 12, 2015

### Optimizing ESSL

The cartoon understanding of compiler design is that compilers consist of three parts:
• front end — handling everything that is language specific
• middle end — language- and hardware-independent optimizations
• back end — code generation, independent of the language
One point I was trying to make in my two previous posts is that the situation is more complex in reality; the backend may take advantage of the ESSL precision qualifiers during instruction selection/scheduling, and this affects what the optimizations are allowed to do. So you cannot use a language/hardware-independent middle end if you have a sufficiently strange architecture and want to take advantage of the latitude ESSL gives you.

There are many other considerations when writing a high-performance compiler for some specific market/language/hardware architecture that may be surprising if you have not worked in that area. I'll  give some examples below that have surprised me over the years.

### Performance, power, and performance measurement

Mobile devices are power constrained, so the clock frequency is dynamically managed to prevent the GPU from running too hot. Different operations consume a different amount of power, and it is not obvious that the fastest shader measured in "number of cycles" is the fastest in "running time", as a slower shader using less power-hungry instructions may be run at a higher clock frequency. So the cycle count may deceive you when you are optimizing shaders.

It is actually very hard to get meaningful performance data when evaluating optimizations (on all systems — not only GPUs), and just implementing an optimization and observing the difference in run time may not tell you if the optimization is beneficial or not. My favorite paper on this is "Producing Wrong Data Without Doing Anything Obviously Wrong!" by Mytkowicz et al. that show that performance of real world applications depend surprisingly much on luck in things like alignment and cache effects. For example, changing the order of files when linking gives up to 15% performance variance for applications in the SPEC CPU2006 benchmark suite. And the result is different for different environments, so you may see a healthy 5% performance uplift in your environment, while the change is actually harmful and makes it slower for most other environments. I have seen many optimization results that I believe are due to this rather than any real improvement...

### Compilation speed

High end mobile games may have hundreds of shaders, and shader compilation is done at application start up, so it is important that the compiler is fast. This means that the optimization strategy should be different compared to a desktop compiler, as you need to be more careful in the tradeoff between optimization run time and potential benefit, and not slow down the compiler by handling cases that are unlikely to happen in real world shaders.

Mobile CPUs have improved a lot the last couple of years, but they are still lagging the desktop when it comes to out-of-order execution etc. This makes the abstraction penalty more painful on mobile processors, and you may want to take that into account when designing an ESSL compiler.

### Optimizations

Desktop compilers are insanely complex, but most of that complexity deals with things that does not happen in shaders; ESSL does not have pointers, so data tracking and aliasing analysis is easy. Shaders does not work on large arrays, so you do not need to transform loops to get better memory accesses pattern. Vectorization is essentially software based warping, so that does not help warp based GPUs. Etc. etc.

And shaders are by necessity small — all mobile phones have high resolution screens, and you cannot spend that many cycles on each pixel if you want a decent frame rate.1 There are not much opportunity for optimizations in small pieces of code, so the relevant optimizations are essentially what you had in an early 90's desktop compiler: inlining, simple loop unrolling, if-conversion, etc.

An important part of compiler development, that is usually glossed over in the compiler literature, is implementing peephole optimizations that maps common code idioms to efficient instruction sequences. Application developers keep inventing strange code constructs, so this is a work package that is never finished. To take a random example from GCC: WebKit implements arithmetic right shift by 4 bits using the idiom
r = (v & ~15) / 16;
so GCC needed to add a rule to recognize as an "arithmetic shift right" instruction. A big part of creating a good compiler is to handle "all" such cases, and graphical shaders have different constructs compared to typical C/C++ code, so you need to invest lots of time looking at real world shaders.

1 For example, 500MHz, 30fps, 1920x1080 translates to 8 cycles/pixel. Most GPUs have multiple cores (or whatever they are called — all GPU vendors have different terminology), so the cycle budget is larger for most devices. But still rather limited.