I read an IWOCL 2016 conference paper "OpenCL-Based Mobile GPGPU Benchmarking: Methods and Challenges" that gives some good advice on how to measure GPU performance on mobile devices. The paper focuses on micro benchmarks, and it is hard to use these to draw relevant conclusions — especially as mobile GPUs are rather different compared to the desktop CPUs that most developers are used to. Below are some of the things that were unclear to me when I started working on the shader compiler for a mobile GPU.
It takes a while for the temperature to rise, so the hardware may temporarily run hotter (and faster), but the device will eventually throttle the performance if it is running too hot. This means that benchmarking results are not that interesting unless you know how long that performance can be sustained, and measurements such as peak performance tend to say just how over-dimensioned the hardware is.
I find it amusing that the discussion in the paper suggest that you want high performance numbers
This performance tuning is done by Dynamic Voltage and Frequency Scaling (DVFS). It is hard to make a good implementation as different applications have different requirements, and there is no "correct" tradeoff. For example, should a high end game run at full speed, and be throttled (i.e. reduced to unplayable frame rate) when the device overheats, or should it run on a lower sustainable speed from the beginning? Different device vendors implement DVFS in different ways, so two phones with the same GPU may behave differently.
Different operations need a different amount of power, and a good DVFS implementation uses this when adjusting the voltage and frequency. For example, memory operations consumes much more power than arithmetic operations, and this is used in Exynos to use lower voltage/frequency for shaders using more memory operations. This is "fun" when optimizing shaders, as a faster shader (as measured in number of clock cycles) does not need to run faster in reality if it uses more power hungry instructions and thus get a lower clock frequency.
Power consumption and heat
The major design constraint for mobile devices is power consumption as it is hard to get rid of heat without large heatsinks and fans. The amount of heat that can be handled varies a lot between devices depending on the size and how they are built, but it corresponds to somewhere between 1W for a low-end phone and 7W for a tablet — and that includes power for CPU, GPU, memory, radio, etc.It takes a while for the temperature to rise, so the hardware may temporarily run hotter (and faster), but the device will eventually throttle the performance if it is running too hot. This means that benchmarking results are not that interesting unless you know how long that performance can be sustained, and measurements such as peak performance tend to say just how over-dimensioned the hardware is.
I find it amusing that the discussion in the paper suggest that you want high performance numbers
[...] benchmarks with long running time may trigger thermal gating in certain cases, resulting in lower performance result.I usually want realistic results, but wanting high numbers makes sense if you (as the authors) work at a hardware vendor and want a high number in the marketing material. They also suggest
Running the benchmark in a temperature-controlled environment is one option; if such an option is not available, adding idle periods between workloads may reduce chances of high system temperature.which is especially important if your SoC has a heat problem. 😏
Dynamic Voltage and Frequency Scaling
Power consumption increases roughly quadratically with clock frequency1 — for example, raising the frequency from 600MHz to 700MHz on Exynos 5433 increases power consumption by 42%. This means it is better to lower the clock frequency and keep the GPU running 100% of the time instead of, for example, running it on full speed 75% of the time and being idle the remaining 25%.This performance tuning is done by Dynamic Voltage and Frequency Scaling (DVFS). It is hard to make a good implementation as different applications have different requirements, and there is no "correct" tradeoff. For example, should a high end game run at full speed, and be throttled (i.e. reduced to unplayable frame rate) when the device overheats, or should it run on a lower sustainable speed from the beginning? Different device vendors implement DVFS in different ways, so two phones with the same GPU may behave differently.
Different operations need a different amount of power, and a good DVFS implementation uses this when adjusting the voltage and frequency. For example, memory operations consumes much more power than arithmetic operations, and this is used in Exynos to use lower voltage/frequency for shaders using more memory operations. This is "fun" when optimizing shaders, as a faster shader (as measured in number of clock cycles) does not need to run faster in reality if it uses more power hungry instructions and thus get a lower clock frequency.
Power- and area-efficiency
GPU workloads are embarrassingly parallel, so it is easy to double the performance of the hardware if you are allowed to increase the power and chip area — just place two identical GPUs in the package! In the same way, you can get much improved power efficiency by using two GPUs and running them with halved frequency. This means that you need to look at metrics such as "performance per area power" when comparing GPU architectures.This is annoying when developing GPUs, as most ideas for improving the performance means that some hardware block becomes more complicated, with the result that the size increases and it consumes more power. And it does not make much sense making the GPU 10% faster if it also need 10% more area and power...
1. The dynamic power consumption is actually \(P_{dyn}=CV^{2}f\) where \(C\) is capacitance, \(V\) is voltage, and \(f\) is frequency. This varies linearly with the frequency, but increased frequency need a higher voltage, and the power consumption thus varies superlinearly with the frequency.
Good discussion. But you may not get the point of microbenchmark. Microbenchmark is not getting high numbers, rather one wants to see the devices's full potential. It is extremely important since in a lot of cases, you never know the real world application, so that you won't know when the throttling will kick in. In such cases, measuring the devices' peak performance could help you understand the device. Microbenchmark basically gives you upper bound of the device's performance.
ReplyDeleteAnd also for SoC vendor, microbenchmark can be used to guided the chip design and driver/compiler design. For instance, on a real device, you may never achieve the performance from the spec due to some design bugs, or unoptimized code. Microbenchmarks in this case can help identify the problems. That being said, you need make sure the number you measured is the best you can get from the device. Not some numbers with DVFS enabled.
I agree that micro-benchmarking is an important tool when designing the GPU and compiler/drivers, but I would assume that those benchmarks are mostly run on FPGAs (or in RTL simulations) – at least that is how I did it when I worked on a GPU compiler...
DeleteMy issue with the paper is that it talks about "profile and compare performance across different devices" where you on mobile GPUs have a "lack of underlying hardware details" and to use that for guiding application development...
The blog post is mostly focusing on "compare performance across different devices" – the technology press is using the GFXBench's micro benchmarks in order to tell which GPU is best, which is meaningless due to the issues mentioned in the blog.
But I still think that micro-benchmarking has very little relevance for application developers too – partly due to the power considerations, and partly because they do not know the details of the hardware. For example, one of the most expensive parts in hardware is moving data around, but arithmetics is relatively cheap, so I'd argue that a good hardware architecture has more ALU capacity than can be utilized in normal code (This is not specific for GPUs. For example, the Intel Nehalem microarchitecture can execute 4 micro-ops each cycle, but the register file has only three read ports, so it cannot feed the ALU unless some data can be taken from other hardware units). So a micro benchmark will give very different results depending on how it flows the data between instructions (and this is very different for different architectures, so you cannot do any "fair" benchmarking). And let's say your application reach 80% of this. You cannot say if that is good or bad – it may be the expected peak utilization for normal code.