Krister Walfridsson’s old blog: September 2016

The technical report "The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V" looks at how the RISC-V ISA compares to ARM and Intel by analyzing the result from running SPEC CINT2006. One thing that surprised me was that 30% of the instructions executed when running the 403.gcc benchmark (which compiles a few files using a modified GCC 3.2) is from doing memset!

The RISC-V memset loop writes 16 bytes in 4 instructions

// RV64G, 4 instructions to move 16 bytes
4a3814:   sd    a1, 0(a4)
4a3818:   sd    a1, 8(a4)
4a381c:   addi  a4, a4, 16
4a3820:   bltu  a4, a3, 4a3814

which is somewhat less efficient compared to ARM and Intel that writes more data per instruction

// armv8, 6 instructions to move 64 bytes
6f0928:   stp   x7, x7, [x8,#16]
6f092c:   stp   x7, x7, [x8,#32]
6f0930:   stp   x7, x7, [x8,#48]
6f0934:   stp   x7, x7, [x8,#64]!
6f0938:   subs  x2, x2, #0x40
6f093c:   b.ge  6f0928

so this should translate to about 10% of the executed ARM instructions doing memset. But that is still much more than what I would have guessed.

I do not have access to SPEC, and I have been too lazy to try to replicate with other data, but a quick literature search indicates that this is not as insane as I thought. The papers I have found look at the cost of clearing data in garbage collection implementations, and they seem to get a similar result for the cost. For example "Why Nothing Matters: The Impact of Zeroing" says

We show that existing approaches of zero initialization are surprisingly expensive. On three modern IA32 architectures, the direct cost is around 2.7-4.5% on average and as much as 12.7% of all cycles, in a high-performance Java Virtual Machine (JVM), without accounting for indirect costs due to cache displacement and memory bandwidth consumption.

I read an IWOCL 2016 conference paper "OpenCL-Based Mobile GPGPU Benchmarking: Methods and Challenges" that gives some good advice on how to measure GPU performance on mobile devices. The paper focuses on micro benchmarks, and it is hard to use these to draw relevant conclusions — especially as mobile GPUs are rather different compared to the desktop CPUs that most developers are used to. Below are some of the things that were unclear to me when I started working on the shader compiler for a mobile GPU.

Power consumption and heat

The major design constraint for mobile devices is power consumption as it is hard to get rid of heat without large heatsinks and fans. The amount of heat that can be handled varies a lot between devices depending on the size and how they are built, but it corresponds to somewhere between 1W for a low-end phone and 7W for a tablet — and that includes power for CPU, GPU, memory, radio, etc.

It takes a while for the temperature to rise, so the hardware may temporarily run hotter (and faster), but the device will eventually throttle the performance if it is running too hot. This means that benchmarking results are not that interesting unless you know how long that performance can be sustained, and measurements such as peak performance tend to say just how over-dimensioned the hardware is.

I find it amusing that the discussion in the paper suggest that you want high performance numbers

[...] benchmarks with long running time may trigger thermal gating in certain cases, resulting in lower performance result.

I usually want realistic results, but wanting high numbers makes sense if you (as the authors) work at a hardware vendor and want a high number in the marketing material. They also suggest

Running the benchmark in a temperature-controlled environment is one option; if such an option is not available, adding idle periods between workloads may reduce chances of high system temperature.

which is especially important if your SoC has a heat problem. 😏

Dynamic Voltage and Frequency Scaling

Power consumption increases roughly quadratically with clock frequency¹ — for example, raising the frequency from 600MHz to 700MHz on Exynos 5433 increases power consumption by 42%. This means it is better to lower the clock frequency and keep the GPU running 100% of the time instead of, for example, running it on full speed 75% of the time and being idle the remaining 25%.

This performance tuning is done by Dynamic Voltage and Frequency Scaling (DVFS). It is hard to make a good implementation as different applications have different requirements, and there is no "correct" tradeoff. For example, should a high end game run at full speed, and be throttled (i.e. reduced to unplayable frame rate) when the device overheats, or should it run on a lower sustainable speed from the beginning? Different device vendors implement DVFS in different ways, so two phones with the same GPU may behave differently.

Different operations need a different amount of power, and a good DVFS implementation uses this when adjusting the voltage and frequency. For example, memory operations consumes much more power than arithmetic operations, and this is used in Exynos to use lower voltage/frequency for shaders using more memory operations. This is "fun" when optimizing shaders, as a faster shader (as measured in number of clock cycles) does not need to run faster in reality if it uses more power hungry instructions and thus get a lower clock frequency.

Power- and area-efficiency

GPU workloads are embarrassingly parallel, so it is easy to double the performance of the hardware if you are allowed to increase the power and chip area — just place two identical GPUs in the package! In the same way, you can get much improved power efficiency by using two GPUs and running them with halved frequency. This means that you need to look at metrics such as "performance per area power" when comparing GPU architectures.

This is annoying when developing GPUs, as most ideas for improving the performance means that some hardware block becomes more complicated, with the result that the size increases and it consumes more power. And it does not make much sense making the GPU 10% faster if it also need 10% more area and power...

1. The dynamic power consumption is actually \(P_{dyn}=CV^{2}f\) where \(C\) is capacitance, \(V\) is voltage, and \(f\) is frequency. This varies linearly with the frequency, but increased frequency need a higher voltage, and the power consumption thus varies superlinearly with the frequency.

Krister Walfridsson’s old blog

Sunday, September 25, 2016

The cost of clearing memory

Monday, September 12, 2016

Mobile GPUs — power, performance, area

Power consumption and heat

Dynamic Voltage and Frequency Scaling

Power- and area-efficiency