Sunday, June 14, 2015

Proving correctness of the GCC match.pd — Part 2

The previous post talked about the background and the GCC GIMPLE IR. This post continues with a look at the format of match.pd.

The match.pd contains transformation rules for the IR, written in a "lispy" domain-specific language. Each transformation rule is specified by the simplify expression that has two operands: a template expression that is matched to the IR, and a replacement expression that replaces the matching expression in the IR. For example
/* x & ~0 -> x */
(simplify
  (bit_and @0 integer_all_onesp)
  @0)
match expressions corresponding to "x&~0" and replaces them with "x".

A template operand of the form @<number> is called a "capture" — it captures the operand so you can refer to it in other parts of the simplify expression. Using the same capture multiple times in a template means that the operands need to be identical, e.g.
/* x & x -> x */
(simplify
  (bit_and @0 @0)
  @0)
match a bit_and with two identical operands.

The predicates, such as integer_all_onesp in the first example above, are the normal predicates used in the GCC middle-end. It is possible to capture operators and predicates too, so you may write
/* x | ~0 -> ~0 */
(simplify
  (bit_ior @0 integer_all_onesp@1)
  @1)
for the rule that changes "x|~0" to "~0".

Pattern matching for commutative operators may use the :c decoration to match the operands in any order. For example
/* x | ~0 -> ~0 */
(simplify
  (bit_ior:c @0 integer_all_onesp@1)
  @1)
will match both "x|~0" and "~0|x" (although it does not matter in this case, as the GCC IR always has the constant as its second argument for commutative operations).

There are cases where you cannot express the replacement using this syntax, so it is possible to generate the replacement expression using C++ code by placing it between curly braces
/* x ^ x -> 0 */
(simplify
  (bit_xor @0 @0)
  { build_zero_cst (type); })
The type of the outermost matching expression (i.e. the type of the bit_xor) is available in the variable type.

It is often the case that additional conditions need to be fulfilled in order to permit the replacement. This can be handled with an if expression:
/* x / -1 -> -x */
(simplify
  (exact_div @0 integer_minus_onep)
  (if (!TYPE_UNSIGNED (type))
   (negate @0)))
This replaces the original expression only if the type of the outermost expression is signed. The operand to if is a C-style expression, so it is possible to create complex conditions, such as
(if (!HONOR_SNANS (element_mode (type))
     && (!HONOR_SIGNED_ZEROS (element_mode (type))
         || !COMPLEX_FLOAT_TYPE_P (type)))
Many rules are identical for several operations, and you can handle this with a for expression
/* x / -1 -> -x */
(for div (trunc_div ceil_div floor_div round_div exact_div)
 (simplify
   (div @0 integer_minus_onep)
   (if (!TYPE_UNSIGNED (type))
    (negate @0))))
The if and for can be nested arbitrarily, and you can use them to construct complex rules where multiple simplifications are done within the same for or if expression.

Finally, there are some additional functionality, such as specifying parts of templates optional, but I'll wait with that until I have made some progress with the functionality described so far...

The next blog post will look at actually proving things.

Sunday, June 7, 2015

API for manipulating and optimizing SPIR-V binaries

I have cleaned up the API in the spirv-tools module I wrote about in a previous blog post. There is currently functionality for
  • Reading/writing SPIR-V binaries and my high level assembler
  • Iterating over instructions, functions, and basic blocks
  • Adding/removing/examining instructions, functions, and basic blocks
  • Some optimization passes (dead code elimination, and CFG simplification)
The idea behind the API is that it should correspond directly to the SPIR-V binary; the binary is conceptually represented as a list of SPIR-V instructions by the Module class, and each Instruction consist of the operation name, result ID, type ID, and operands. Iterating over the module returns instructions in the same order as in the binary. The API also has concepts of functions and basic blocks, and the Function and BasicBlock classes encapsulates sub-sequences of the binary's instructions.

Reading and examining binaries can be done with a minimum of code. For example, here is how to read a SPIR-V binary and count the number of load instructions:
#!/usr/bin/env python
import read_spirv

with open('frag.spv', 'r') as stream:
    module = read_spirv.read_module(stream)

nof_loads = 0
for inst in module.instructions():
    if inst.op_name == 'OpLoad':
        nof_loads += 1

print 'Number of load instructions: ' + str(nof_loads)

The main use case for the API is analyzing binaries, test generation, etc., but it is useful for implementing optimizations too. For example, a simple peephole optimization for transforming integer "-(-x)" to "x" can be written as
for inst in module.instructions():
    if inst.op_name == 'OpSNegate':
        op_inst = module.id_to_inst[inst.operands[0]]
        if op_inst.op_name == 'OpSNegate':
            src_inst = module.id_to_inst[op_inst.operands[0]]
            inst.replace_uses_with(src_inst)
For each instruction we check if it is an OpSNegate instruction, if so, we access the predecessor instruction. If that too is an OpSNegate, then we replaces all uses of the original instruction with the second instruction's predecessor. This leaves the OpSNegate dead, so you probably want to run the dead code elimination pass after this, which you do as
dead_code_elim.optimize(module)

A slightly more involved example is optimizing integer "x+x" to "x<<1":
for inst in module.instructions():
    if (inst.op_name == 'OpIAdd' and
           inst.operands[0] == inst.operands[1]):
        const1 = module.get_constant(inst.type_id, 1)
        sll = ir.Instruction(module, 'OpShiftLeftLogical',
                             module.new_id(), inst.type_id,
                             [inst.operands[0], const1.result_id])
        sll.copy_decorations(inst)
        inst.replace_with(sll)
Here we need to replace the OpIAdd instruction with a newly created OpShiftLeftLogical. One complication is that SPIR-V specifies the type partially by the PrecisionLow, PrecisionMedium, and PrecisionHigh decorations, so we need to copy the decorations each time we create a new instruction as failing to do this may give a dramatic performance reduction for some architectures. I still think that SPIR-V should change how the type and precision modifiers are handled for graphical shaders...

The module.get_constant returns an OpConstant or OpConstantComposite instruction with the specified type and value. The value should in general have the same number of elements as the type for vector types, but it is allowed to pass a scalar value, which replicates the value over the vector width.

inst.replace_with(sll) replaces inst with sll in the basic block, updates all uses of inst to use sll, and destroys inst. It is safe to add/remove instructions while iterating. The only possible issue is that the iterator may not see instructions that are inserted in the current basic block, and you may see an instruction in its old place if it is moved within the current basic block. The predecessors/successors are however always correctly updated — it is only the iteration order that may be surprising.


Basic blocks are handled in a similar way to instructions. See the simplify_cfg pass for an example of how they are used in reality.


There are still some functionality missing (see the TODO file) that I'm planning to fix eventually. Please let me know if you need some specific functionality, and I'll bump its priority.

Sunday, May 31, 2015

Running the GCC test-suite for epiphany-sim

I wanted to run the GCC test-suite on Adapteva’s Epiphany architecture, but I could not find much useful information on how to do it. This post documents what I eventually managed to get running.

The GCC "simtest howto" (having examples/results from 2003 — I'll send a patch to update it...) suggests using a "combined tree" where the source code from GCC, binutils, GDB, and newlib are merged. I'd like to avoid this, as I want to be able to test with different revisions of the components, and I do not trust that I will get reproducible results with the combined tree (for example, both binutils and GDB includes libbfd, and I want to ensure that binutils is built with the correct version).

The instructions below builds everything separately, using the latest released versions. It is assumed that DIST contains the path to the source code packages, and that PREFIX is the path where the resulting toolchain will be installed.

Building binutils

Binutils is built as
tar zxf ${DIST}/binutils-2.25.tar.gz
mkdir build_binutils && cd build_binutils
../binutils-2.25/configure --prefix=${PREFIX} --target=epiphany-elf
make -j4
make install
cd ..

Building GCC

GCC need support from GMP, MPFR, etc. These can be handled using shared libraries, but I want to make sure I know which versions are used. The easiest way of handling this is to place the libraries' source code within the GCC source tree, which builds them as a part of GCC.
tar zxf ${DIST}/gcc-5.1.0.tar.gz
tar zxf ${DIST}/gmp-6.0.0a.tar.bz2 
mv gmp-6.0.0 gcc-5.1.0/gmp
tar zxf ${DIST}/mpc-1.0.3.tar.gz 
mv mpc-1.0.3 gcc-5.1.0/mpc
tar zxf ${DIST}/mpfr-3.1.2.tar.gz 
mv mpfr-3.1.2 gcc-5.1.0/mpfr
tar zxf ${DIST}/isl-0.14.tar.bz2 
mv isl-0.14 gcc-5.1.0/isl
The GCC source tree has a script contrib/download_prerequisites that downloads and extracts the correct versions of GMP etc.

We cannot build GCC before we have a full environment with newlib, but GCC is needed in order to build newlib. We, therefore, start by building a somewhat limited version of GCC that can be used to build the library.
mkdir build_gcc_tmp && cd build_gcc_tmp
../gcc-5.1.0/configure --prefix=${PREFIX} --target=epiphany-elf \
    --enable-languages="c" --with-newlib --without-headers
make -j4 all-gcc
make install-gcc
cd ..

Building newlib

Newlib can now be built as
tar zxf ${DIST}/newlib-2.2.0.tar.gz
mkdir build_newlib && cd build_newlib
env PATH="${PREFIX}/bin:${PATH}" \
    ../newlib-2.2.0/configure --prefix=${PREFIX} --target=epiphany-elf
env PATH="${PREFIX}/bin:${PATH}" make -j4 all
env PATH="${PREFIX}/bin:${PATH}" make install
cd ..

Building GCC again

The "real" GCC is built as
mkdir build_gcc && cd build_gcc
../gcc-5.1.0/configure --prefix=${PREFIX} --target=epiphany-elf \
    --enable-languages="c,c++" --with-newlib
make -j4
make install
cd ..

Building the simulator

The testing is done by running the compiled code on a simulator that is built as a part of GDB, but the GNU GDB distribution does not have support for Epiphany. We, therefore, use the epiphany-gdb-7.8 branch from https://github.com/adapteva/epiphany-binutils-gdb. This repository contains both GDB and some random version of binutils, but we only need the simulator:
unzip ${DIST}/epiphany-binutils-gdb-epiphany-gdb-7.8.zip
mkdir build_sim && cd build_sim
../epiphany-binutils-gdb-epiphany-gdb-7.8/configure \
    --prefix=${PREFIX} --target=epiphany-elf
make -j4 all-sim
make install-sim
cd ..

Running the GCC test-suite

Dejagnu has configuration files for running tests on simulators for most hardware architectures, but not for Epiphany, so we need to create a configuration file epiphany-sim.exp. I'm using the following, that is a modified version of arm-sim.exp:
# Load the generic configuration for this board. This will define a basic
# set of routines used to communicate with the board.
load_generic_config "sim"

# No multilib flags needed by default.
process_multilib_options ""

# basic-sim.exp is a basic description for the standard Cygnus simulator.
load_base_board_description "basic-sim"

# The name of the directory in the build tree where the simulator lives.
setup_sim epiphany

# The compiler used to build for this board. This has *nothing* to do
# with what compiler is tested if we're testing gcc.
set_board_info compiler "[find_gcc]"

# The basic set of flags needed to build "hello world" for this
# board. This board uses libgloss and newlib.
set_board_info cflags   "[libgloss_include_flags] [newlib_include_flags]"
set_board_info ldflags  "[libgloss_link_flags] [newlib_link_flags]"

# This board doesn't use a linker script.
set_board_info ldscript ""

# No support for signals.
set_board_info gdb,nosignals 1
This file needs to be added to dejagnu's search path through a global configuration file. But you do not really need to add the path to the configuration file — dejagnu automatically adds a search path to the board directory in the same place as the configuration file is located. So it is enough to create an empty file ~/dejagnu/config.exp, and copy epiphany-sim.exp to ~/dejagnu/boards/epiphany-sim.exp.

The GCC test-suite can now be run as
cd build_gcc
env PATH="${PREFIX}/bin:${PATH}" DEJAGNU="~/dejagnu/config.exp" \
    make -j4 check-gcc RUNTESTFLAGS="--target_board=epiphany-sim"


This post was updated 2017-08-13 with a note about contrib/download_prerequisites.

Sunday, May 24, 2015

Human friendly SPIR-V textual representation

SPIR-V is a binary IL that is not meant to be written by humans. But there are many cases where it is desirable to write/modify IL, so I have defined a textual representation that I believe is more convenient to work with than the raw disassembly format used in the SPIR-V specification.

I have chosen to use an LLVM-like representation, as I'm used to that format. A typical instruction is written as
%58 = OpIAdd s32 %57, %32
Constants may be written directly as operands to the instructions. For example, if %32 is a constant
%32 = OpConstant s32 1
then the instruction %58 above can be written as
%58 = OpIAdd s32 %57, 1
In the same way, decorations may be attached directly to the instructions instead of having separate decoration instructions at the top of the file. For example
OpDecorate %56, PrecisionMedium
%56 = OpFMul <4 x f32> %55, %54
can be written as
%56 = OpFMul PrecisionMedium <4 x f32> %55, %54
Names can be used instead of the <id> number
%tmp = OpLoad <4 x f32> %21
%56 = OpFMul <4 x f32> %tmp, %54
This makes the assembler allocate a numerical <id> and adds debug information with the name. In general, you do not need to specify things that the assembler can generate by itself, such as the constant and decoration instructions above, or the CFG — the assembler reorders the basic blocks when needed.

The SPIR-V format spreads some information over several instructions in different parts of the binary. This textual representation allows collecting those to one statement, so global variables may be written as
@gl_VertexID = Input s32 PrecisionHigh BuiltIn(5) NoStaticUse
which generates instructions
OpName %16, "gl_VertexID"
OpDecorate %16, PrecisionHigh
OpDecorate %16, BuiltIn, 5
OpDecorate %16, NoStaticUse
%15 = OpTypePointer Input, s32
%16 = OpVariable %15 Input
and function definitions can in a similar way be written as
define <4 x f32> @foo(<4 x f32> %a) {
  ...
}
instead of
OpName %12, "foo"
OpName %11, "a"
%10 = OpTypeFunction <4 x f32>, <4 x f32>
%12 = OpFunction %8 0, %10
%11 = OpFunctionParameter <4 x f32>
  ...
OpFunctionEnd

As an example of how this looks like, I have disassembled a shader using my format

The shader is the same as used in the raw disassembly example in the SPIR-V specification


An assembler/disassembler implementing most of the above is available in my spirv-tools github repository. The disassembler tries to take advantage of the syntactic sugar per default, which has the drawback that you do not have full control over <id> numbering etc., and you will in general get a different binary if you re-assemble the shader. But there is a command line option -r to disable this and output instructions exactly as in the binary, which is useful if you want to e.g. modify the code to trigger some special case in your compiler.

The implementation is rather rough right now, so it may not work on your favorite SPIR-V binary. But I'll spend some more time on this the coming weeks (I plan to to formalize and document the syntax, and fix the issues mentioned in the TODO file), so I expect to have a working assembler/disassembler well before the first Vulkan driver is available... :)

Sunday, May 17, 2015

Out of memory handling

I watched a video from CppCon 2014 where the speaker said during Q&A
[...] if you are on Linux, you know, malloc is never going to return NULL. It's always going to give you a chunk of memory, even if memory is full. It's going to say "I can get it from somewhere at some point", and if you actually runs out of memory, what happens is that the OS kills you.
I hear this a lot — there is no need to handle out of memory conditions as you'll never get NULL from malloc, and the OS will kill your process anyway. But it is wrong; there are at least two cases where malloc will return NULL on Linux:
  • Per-process memory limits are configured, and the process is exceeding those.
  • A 32-bit application running under a 64-bit kernel is trying to use more than about 4 gigabytes of memory.
So you need to deal with malloc returning NULL.

I'm not saying that you must handle out of memory conditions gracefully, although I would argue it is a good idea (especially if you are developing libraries). But you should at least check if malloc fails, as dereferencing NULL invokes undefined behavior in C, and may lead to surprising results from compiler optimizations.1,2


1 Such as this old Linux 2.6.30 kernel exploit.
2 I cannot see how the compiler may introduce problems by exploiting the undefined behavior resulting from not checking for malloc failure, but I'm sure GCC will find a way...

Tuesday, May 12, 2015

Optimizing ESSL

The cartoon understanding of compiler design is that compilers consist of three parts:
  • front end — handling everything that is language specific
  • middle end — language- and hardware-independent optimizations
  • back end — code generation, independent of the language
One point I was trying to make in my two previous posts is that the situation is more complex in reality; the backend may take advantage of the ESSL precision qualifiers during instruction selection/scheduling, and this affects what the optimizations are allowed to do. So you cannot use a language/hardware-independent middle end if you have a sufficiently strange architecture and want to take advantage of the latitude ESSL gives you.

There are many other considerations when writing a high-performance compiler for some specific market/language/hardware architecture that may be surprising if you have not worked in that area. I'll  give some examples below that have surprised me over the years.

Performance, power, and performance measurement

Mobile devices are power constrained, so the clock frequency is dynamically managed to prevent the GPU from running too hot. Different operations consume a different amount of power, and it is not obvious that the fastest shader measured in "number of cycles" is the fastest in "running time", as a slower shader using less power-hungry instructions may be run at a higher clock frequency. So the cycle count may deceive you when you are optimizing shaders.

It is actually very hard to get meaningful performance data when evaluating optimizations (on all systems — not only GPUs), and just implementing an optimization and observing the difference in run time may not tell you if the optimization is beneficial or not. My favorite paper on this is "Producing Wrong Data Without Doing Anything Obviously Wrong!" by Mytkowicz et al. that show that performance of real world applications depend surprisingly much on luck in things like alignment and cache effects. For example, changing the order of files when linking gives up to 15% performance variance for applications in the SPEC CPU2006 benchmark suite. And the result is different for different environments, so you may see a healthy 5% performance uplift in your environment, while the change is actually harmful and makes it slower for most other environments. I have seen many optimization results that I believe are due to this rather than any real improvement...

Compilation speed

High end mobile games may have hundreds of shaders, and shader compilation is done at application start up, so it is important that the compiler is fast. This means that the optimization strategy should be different compared to a desktop compiler, as you need to be more careful in the tradeoff between optimization run time and potential benefit, and not slow down the compiler by handling cases that are unlikely to happen in real world shaders.

Mobile CPUs have improved a lot the last couple of years, but they are still lagging the desktop when it comes to out-of-order execution etc. This makes the abstraction penalty more painful on mobile processors, and you may want to take that into account when designing an ESSL compiler.

Optimizations 

Desktop compilers are insanely complex, but most of that complexity deals with things that does not happen in shaders; ESSL does not have pointers, so data tracking and aliasing analysis is easy. Shaders does not work on large arrays, so you do not need to transform loops to get better memory accesses pattern. Vectorization is essentially software based warping, so that does not help warp based GPUs. Etc. etc.

And shaders are by necessity small — all mobile phones have high resolution screens, and you cannot spend that many cycles on each pixel if you want a decent frame rate.1 There are not much opportunity for optimizations in small pieces of code, so the relevant optimizations are essentially what you had in an early 90's desktop compiler: inlining, simple loop unrolling, if-conversion, etc.

An important part of compiler development, that is usually glossed over in the compiler literature, is implementing peephole optimizations that maps common code idioms to efficient instruction sequences. Application developers keep inventing strange code constructs, so this is a work package that is never finished. To take a random example from GCC: WebKit implements arithmetic right shift by 4 bits using the idiom
r = (v & ~15) / 16;
so GCC needed to add a rule to recognize as an "arithmetic shift right" instruction. A big part of creating a good compiler is to handle "all" such cases, and graphical shaders have different constructs compared to typical C/C++ code, so you need to invest lots of time looking at real world shaders.


1 For example, 500MHz, 30fps, 1920x1080 translates to 8 cycles/pixel. Most GPUs have multiple cores (or whatever they are called — all GPU vendors have different terminology), so the cycle budget is larger for most devices. But still rather limited.

Sunday, April 26, 2015

Floating point, precision quaifiers, and optimization

ESSL permits optimizations that may change the value of floating point expressions (lowp and mediump precision change, reassociation of addition/multiplication, etc.), which means that identical expressions may give different results in different shaders. This may cause problems with e.g. alignment of geometry in multi-pass algorithms, so output variables may be decorated with the invariant qualifier to force the compiler to be consistent in how it generates code for them. The compiler is still allowed to do value-changing optimizations for invariant expressions, but it need to do it in the same way for all shaders. This may give us interesting problems if optimizations and code generation are done without knowledge of each other...

Example 1

As an example of the problems we may get with invariant, consider an application that is generating optimized SPIR-V using an offline ESSL compiler, and uses the IR with a Vulkan driver having a simple backend. The backend works on one basic block at a time, and is generating FMA (Fused Multiply-Add) instructions when multiplication is followed by addition. This is fine for invariant, even though FMA changes the precision, as the backend is consistent and always generates FMA when possible (i.e. identical expressions in different shaders will generate identical instructions).

The application has a shader
#version 310 es

in float a, b, c;
out invariant float result;

void main() {
    float tmp = a * b;
    if (c < 0.0) {
       result = tmp - 1.0;
    } else {
       result = tmp + 1.0;
    }
}
This is generated exactly as written if no optimization is done; first a multiplication, followed by a compare and branch, and we have two basic blocks doing one addition each. But the offline compiler optimizes this with if-conversion, so it generates SPIR-V as if main was written as
void main()
{
    float tmp = a * b;
    result = (c < 0.0) ? (tmp - 1.0) : (tmp + 1.0);
}
The optimization has eliminated the branches, and the backend will now see that it can use FMA instructions as everything is in the same basic block.

But the application has one additional shader where main looks like
void main() {
    float tmp = a * b;
    if (c < 0.0) {
       foo();
       result = tmp - 1.0;
    } else {
       result = tmp + 1.0;
    }
}
The optimization cannot transform the if-statement here, as the basic blocks are too complex. So this will not use FMA, and will therefore break the invariance guarantee. 

Example 2

It is not only invariant expressions that are problematic — you may get surprising results from normal code too when optimizations done offline and in the backend interacts in interesting ways. For example, you can get different precision in different threads from "redundant computation elimination" optimizations. This happens for cases such as 
mediump float tmp = a + b;
if (x == 0) {
  /* Code not using tmp */
  ...
} else if (x == 1) {
  /* Code using tmp */
  ...
} else {
  /* Code using tmp */
  ...
}
where tmp is calculated, but not used, for the case "x == 0". The optimization moves the tmp calculation into the two basic blocks where it is used
if (x == 0) {
  /* Code not using tmp */
  ...
} else if (x == 1) {
  mediump float tmp = a + b;
  /* Code using tmp */
  ...
} else {
  mediump float tmp = a + b;
  /* Code using tmp */
  ...
}
and the backend may now chose to use different precisions for the two mediump tmp calculations. 

Offline optimization with SPIR-V

The examples above are of course silly — higher level optimizations should not be allowed to change control flow for invariant statements, and the "redundant computation elimination" does not make sense for warp-based architectures. But the first optimization would have been fine if used with a better backend that could combine instructions from different basic blocks. And not all GPUs are warp-based. That is, it is reasonable to do this kind of optimizations, but they need to be done in the driver where you have full knowledge about the backend and architecture.

My impression is that many developers believe that SPIR-V and Vulkan implies that the driver will just do simple code generation, and that all optimizations are done offline. But that will prevent some optimizations. It may work for a game engine generating IR for a known GPU, but I'm not sure that the GPU vendors will provide enough information on their architectures/backends that this will be viable either.

So my guess is that the drivers will continue to do all the current optimizations on SPIR-V too, and that offline optimizations will not matter...

Thursday, April 9, 2015

Precision qualifiers in SPIR-V

SPIR-V is a bit inconsistent in how it handles types for graphical shaders and compute kernels. Kernels are using sized types, and there are explicit conversions when converting between sizes. Shaders are using 32-bit types for everything, but there are precision decorations that indicates which size is really used, and conversions between sizes are done implicitly. I guess much of this is due to historical reasons in how ESSL defines its types, but I think it would be good to be more consistent in the IR.

ESSL 1 played fast and loose with types. For example, it has an integer type int, but the platform is allowed to implement it as floating point, so it is not necessarily true that "a+1 != a" for a sufficiently large a. ESSL 3 strengthened the type system, so for example high precision integers are now represented as 32-bit values in two's complement form. The rest of this post will use the ESSL 3 semantics.

ESSL does not care much about the size of variables; it has only one integer type "int" and one floating point type "float". But you need to specify which precision to use in calculations by adding precision qualifiers when you declare your variables, such as

highp float x;
Using highp means that the calculations must be done in 32-bit precision, mediump means at least 16-bit precision, and lowp means using at lest 9 bits (yes, "nine". You cannot fit a lowp value in a byte). The compiler may use any size for the variables, as long as the precision is preserved.

So "mediump int" is similar to the int_least16_t type in C, but ESSL permits the compiler to use different precision for different instructions. It can for example use 16-bit precision for one mediump addition, and 32-bit for another, so it is not necessarily true that "a+b == a+b" for mediump integers a and b if the addition overflow 16 bits. The reason for having this semantics is to be able to use the hardware efficiently. Consider for example a processor having two parallel arithmetic units — one 16-bit and one 32-bit. If we have a shader where all instructions are mediump, then we could only reach 50% utilization by executing all instructions as 16-bit. But the backend can now promote half of them to 32-bit and thus be able to double the performance by using both arithmetic units.

SPIR-V is representing this by always using a 32-bit type and decorating the variables and instructions with PrecisionLow, PrecisionMedium, or PrecisionHigh. The IR does not have any type conversions for the precision as the actual type is the same, and it is only the precision of the instruction that differ. But ESSL has requirements on conversions when changing precision in operations that is similar to how size change is handled in other languages:

When converting from a higher precision to a lower precision, if the value is representable by the implementation of the target precision, the conversion must also be exact. If the value is not representable, the behavior is dependent on the type:
  • For signed and unsigned integers, the value is truncated; bits in positions not present in the target precision are set to zero. (Positions start at zero and the least significant bit is considered to be position zero for this purpose.)
  • For floating point values, the value should either clamp to +INF or -INF, or to the maximum or minimum value that the implementation supports. While this behavior is implementation dependent, it should be consistent for a given implementation
It is of course fine to have the conversions implicit in the IR, but the conversions are explicit for the similar conversion fp32 to fp16 in kernels, so it is inconsistent. I would in general want the shader and kernel IR to be as similar as possible in order to avoid confusion when writing SPIR-V tools working on both types of IR, and I think it is possible to improve this with minor changes:
  • The highp precision qualifier means that the compiler must use 32-bit precision, i.e. a highp-qualified type is the same as as the normal non-qualified 32-bit type. So the PrecisionHigh does not tell the compiler anything; it just adds noise to the IR, and can be removed from SPIR-V.
  • Are GPUs really taking advantage of lowp for calculations? I can understand how lowp may be helpful for e.g. saving power in varying interpolation, and those cases are handled by having the PrecisionLow decoration on variables. But it seems unlikely to me that any GPU have added the extra hardware to do arithmetic in lowp precision, and I would assume all GPUs use 16-bit or higher for lowp arithmetic. If so, then PrecisionLow should not be a valid decoration for instructions.
  • The precision decorations are placed on instructions, but it seems better to me to have the them on the type instead. If PrecisionLow and PrecisionHigh are removed, then PrecisionMedium is the only decoration left. But this can be treated as a normal 16-bit type from the optimizers point of view, so we could instead permit both 32- and 16-bit types for graphical shaders, and specify in the execution model that it is allowed to promote 16-bit to 32-bit. Optimizations and type conversions can then be done in exactly the same way as for kernels, and the backend can promote the types as appropriate for the hardware.