Sunday, May 31, 2015

Running the GCC testsuite for epiphany-sim

I wanted to run the GCC testsuite on Adapteva’s Epiphany architecture, but I could not find much useful information on how to do it. This post documents what I eventually managed to get running.

The GCC "simtest howto" (having examples/results from 2003 — I'll send a patch to update it...) suggests using a "combined tree" where the source code from GCC, binutils, GDB, and newlib are merged. I'd like to avoid this, as I want to be able to test with different revisions of the components, and I do not trust that I will get reproducible results with the combined tree (for example, both binutils and GDB includes libbfd, and I want to ensure that binutils is built with the correct version).

The instructions below builds everything separately, using the latest released versions. It is assumed that DIST contains the path to the source code packages, and that PREFIX is the path where the the resulting toolchain will be installed.

Building binutils

Binutils is built as
tar zxf ${DIST}/binutils-2.25.tar.gz mkdir build_binutils && cd build_binutils ../binutils-2.25/configure --prefix=${PREFIX} --target=epiphany-elf
make -j4
make install
cd ..


Building GCC

GCC need support from GMP, MPFR, etc. These can be handled using shared libraries, but I want to make sure I know which versions are used. The easiest way of handling this is to place the libraries' source code within the GCC source tree, which builds them as a part of GCC.
tar zxf ${DIST}/gcc-5.1.0.tar.gz tar zxf${DIST}/gmp-6.0.0a.tar.bz2
mv gmp-6.0.0 gcc-5.1.0/gmp
tar zxf ${DIST}/mpc-1.0.3.tar.gz mv mpc-1.0.3 gcc-5.1.0/mpc tar zxf${DIST}/mpfr-3.1.2.tar.gz
mv mpfr-3.1.2 gcc-5.1.0/mpfr
tar zxf ${DIST}/isl-0.14.tar.bz2 mv isl-0.14 gcc-5.1.0/isl  We cannot build GCC before we have a full environment with newlib, but GCC is needed in order to build newlib. We therefore start by building a somewhat limited version of GCC that can be used to build the library. mkdir build_gcc_tmp && cd build_gcc_tmp ../gcc-5.1.0/configure --prefix=${PREFIX} --target=epiphany-elf \
make -j4 all-gcc
make install-gcc
cd ..


Building newlib

Newlib can now be built as
tar zxf ${DIST}/newlib-2.2.0.tar.gz mkdir build_newlib && cd build_newlib env PATH="${PREFIX}/bin:${PATH}" \ ../newlib-2.2.0/configure --prefix=${PREFIX} --target=epiphany-elf
env PATH="${PREFIX}/bin:${PATH}" make -j4 all
env PATH="${PREFIX}/bin:${PATH}" make install
cd ..


Building GCC again

The "real" GCC is built as
mkdir build_gcc && cd build_gcc
../gcc-5.1.0/configure --prefix=${PREFIX} --target=epiphany-elf \ --enable-languages="c,c++" --with-newlib make -j4 make install cd ..  Building the simulator The testing is done by running the compiled code on a simulator that is built as a part of GDB, but the GNU GDB distribution does not have support for Epiphany. We therefore use the epiphany-gdb-7.8 branch from https://github.com/adapteva/epiphany-binutils-gdb. This repository contains both GDB and some random version of binutils, but we only need the simulator: unzip${DIST}/epiphany-binutils-gdb-epiphany-gdb-7.8.zip
mkdir build_sim && cd build_sim
../epiphany-binutils-gdb-epiphany-gdb-7.8/configure \
--prefix=${PREFIX} --target=epiphany-elf make -j4 all-sim make install-sim cd ..  Running the GCC testsuite Dejagnu has configuration files for running tests on simulators for most hardware architectures, but not for Epiphany, so we need to create a configuration file epiphany-sim.exp. I'm using the following, that is a modified version of arm-sim.exp: # Load the generic configuration for this board. This will define a basic # set of routines used to communicate with the board. load_generic_config "sim" # No multilib flags needed by default. process_multilib_options "" # basic-sim.exp is a basic description for the standard Cygnus simulator. load_base_board_description "basic-sim" # The name of the directory in the build tree where the simulator lives. setup_sim epiphany # The compiler used to build for this board. This has *nothing* to do # with what compiler is tested if we're testing gcc. set_board_info compiler "[find_gcc]" # The basic set of flags needed to build "hello world" for this # board. This board uses libgloss and newlib. set_board_info cflags "[libgloss_include_flags] [newlib_include_flags]" set_board_info ldflags "[libgloss_link_flags] [newlib_link_flags]" # This board doesn't use a linker script. set_board_info ldscript "" # No support for signals. set_board_info gdb,nosignals 1  This file needs to be added to dejagnu's search path through a global configuration file. But you do not really need to add the path in the configuration file — dejagnu automatically adds a search path to the board directory in the same place as the configuration file is located. So it is enough to create an empty file ~/dejagnu/config.exp, and copy epiphany-sim.exp to ~/dejagnu/boards/epiphany-sim.exp. The GCC testsuite can now be run as cd build_gcc env PATH="${PREFIX}/bin:\${PATH}" DEJAGNU="~/dejagnu/config.exp" \
make -j4 check-gcc RUNTESTFLAGS="--target_board=epiphany-sim"


Sunday, May 24, 2015

Human friendly SPIR-V textual representation

SPIR-V is a binary IL that is not meant to be written by humans. But there are many cases where it is desirable to write/modify IL, so I have defined a textual representation that I believe is more convenient to work with than the raw disassembly format used in the SPIR-V specification.

I have chosen to use an LLVM-like representation, as I'm used to that format. A typical instruction is written as
%58 = OpIAdd s32 %57, %32

Constants may be written directly as operands to the instructions. For example, if %32 is a constant
%32 = OpConstant s32 1

then the instruction %58 above can be written as
%58 = OpIAdd s32 %57, 1

In the same way, decorations may be attached directly to the instructions instead of having separate decoration instructions at the top of the file. For example
OpDecorate %56, PrecisionMedium
%56 = OpFMul <4 x f32> %55, %54

can be written as
%56 = OpFMul PrecisionMedium <4 x f32> %55, %54

Names can be used instead of the <id> number
%tmp = OpLoad <4 x f32> %21
%56 = OpFMul <4 x f32> %tmp, %54
This makes the assembler allocate a numerical <id> and adds debug information with the name. In general, you do not need to specify things that the assembler can generate by itself, such as the constant and decoration instructions above, or the CFG — the assembler reorders the basic blocks when needed.

The SPIR-V format spreads some information over several instructions in different parts of the binary. This textual representation allows collecting those to one statement, so global variables may be written as
@gl_VertexID = Input s32 PrecisionHigh BuiltIn(5) NoStaticUse

which generates instructions
OpName %16, "gl_VertexID"
OpDecorate %16, PrecisionHigh
OpDecorate %16, BuiltIn, 5
OpDecorate %16, NoStaticUse
%15 = OpTypePointer Input, s32
%16 = OpVariable %15 Input

and function definitions can in a similar way be written as
define <4 x f32> @foo(<4 x f32> %a) {
...
}

OpName %12, "foo"
OpName %11, "a"
%10 = OpTypeFunction <4 x f32>, <4 x f32>
%12 = OpFunction %8 0, %10
%11 = OpFunctionParameter <4 x f32>
...
OpFunctionEnd


As an example of how this looks like, I have disassembled a shader using my format

The shader is the same as used in the raw disassembly example in the SPIR-V specification

An assembler/disassembler implementing most of the above is available in my spirv-tools github repository. The disassembler tries to take advantage of the syntactic sugar per default, which has the drawback that you do not have full control over <id> numbering etc., and you will in general get a different binary if you re-assemble the shader. But there is a command line option -r to disable this and output instructions exactly as in the binary, which is useful if you want to e.g. modify the code to trigger some special case in your compiler.

The implementation is rather rough right now, so it may not work on your favorite SPIR-V binary. But I'll spend some more time on this the coming weeks (I plan to to formalize and document the syntax, and fix the issues mentioned in the TODO file), so I expect to have a working assembler/disassembler well before the first Vulkan driver is available... :)

Sunday, May 17, 2015

Out of memory handling

I watched a video from CppCon 2014 where the speaker said during Q&A
[...] if you are on Linux, you know, malloc is never going to return NULL. It's always going to give you a chunk of memory, even if memory is full. It's going to say "I can get it from somewhere at some point", and if you actually runs out of memory, what happens is that the OS kills you.
I hear this a lot — there is no need to handle out of memory conditions as you'll never get NULL from malloc, and the OS will kill your process anyway. But it is wrong; there are at least two cases where malloc will return NULL on Linux:
• Per-process memory limits are configured, and the process is exceeding those.
• A 32-bit application running under a 64-bit kernel is trying to use more than about 4 gigabytes of memory.
So you need to deal with malloc returning NULL.

I'm not saying that you must handle out of memory conditions gracefully, although I would argue it is a good idea (especially if you are developing libraries). But you should at least check if malloc fails, as dereferencing NULL invokes undefined behavior in C, and may lead to surprising results from compiler optimizations.1,2

1 Such as this old Linux 2.6.30 kernel exploit.
2 I cannot see how the compiler may introduce problems by exploiting the undefined behavior resulting from not checking for malloc failure, but I'm sure GCC will find a way...

Tuesday, May 12, 2015

Optimizing ESSL

The cartoon understanding of compiler design is that compilers consist of three parts:
• front end — handling everything that is language specific
• middle end — language- and hardware-independent optimizations
• back end — code generation, independent of the language
One point I was trying to make in my two previous posts is that the situation is more complex in reality; the backend may take advantage of the ESSL precision qualifiers during instruction selection/scheduling, and this affects what the optimizations are allowed to do. So you cannot use a language/hardware-independent middle end if you have a sufficiently strange architecture and want to take advantage of the latitude ESSL gives you.

There are many other considerations when writing a high-performance compiler for some specific market/language/hardware architecture that may be surprising if you have not worked in that area. I'll  give some examples below that have surprised me over the years.

Performance, power, and performance measurement

Mobile devices are power constrained, so the clock frequency is dynamically managed to prevent the GPU from running too hot. Different operations consume a different amount of power, and it is not obvious that the fastest shader measured in "number of cycles" is the fastest in "running time", as a slower shader using less power-hungry instructions may be run at a higher clock frequency. So the cycle count may deceive you when you are optimizing shaders.

It is actually very hard to get meaningful performance data when evaluating optimizations (on all systems — not only GPUs), and just implementing an optimization and observing the difference in run time may not tell you if the optimization is beneficial or not. My favorite paper on this is "Producing Wrong Data Without Doing Anything Obviously Wrong!" by Mytkowicz et al. that show that performance of real world applications depend surprisingly much on luck in things like alignment and cache effects. For example, changing the order of files when linking gives up to 15% performance variance for applications in the SPEC CPU2006 benchmark suite. And the result is different for different environments, so you may see a healthy 5% performance uplift in your environment, while the change is actually harmful and makes it slower for most other environments. I have seen many optimization results that I believe are due to this rather than any real improvement...

Compilation speed

High end mobile games may have hundreds of shaders, and shader compilation is done at application start up, so it is important that the compiler is fast. This means that the optimization strategy should be different compared to a desktop compiler, as you need to be more careful in the tradeoff between optimization run time and potential benefit, and not slow down the compiler by handling cases that are unlikely to happen in real world shaders.

Mobile CPUs have improved a lot the last couple of years, but they are still lagging the desktop when it comes to out-of-order execution etc. This makes the abstraction penalty more painful on mobile processors, and you may want to take that into account when designing an ESSL compiler.

Optimizations

Desktop compilers are insanely complex, but most of that complexity deals with things that does not happen in shaders; ESSL does not have pointers, so data tracking and aliasing analysis is easy. Shaders does not work on large arrays, so you do not need to transform loops to get better memory accesses pattern. Vectorization is essentially software based warping, so that does not help warp based GPUs. Etc. etc.

And shaders are by necessity small — all mobile phones have high resolution screens, and you cannot spend that many cycles on each pixel if you want a decent frame rate.1 There are not much opportunity for optimizations in small pieces of code, so the relevant optimizations are essentially what you had in an early 90's desktop compiler: inlining, simple loop unrolling, if-conversion, etc.

An important part of compiler development, that is usually glossed over in the compiler literature, is implementing peephole optimizations that maps common code idioms to efficient instruction sequences. Application developers keep inventing strange code constructs, so this is a work package that is never finished. To take a random example from GCC: WebKit implements arithmetic right shift by 4 bits using the idiom
r = (v & ~15) / 16;
so GCC needed to add a rule to recognize as an "arithmetic shift right" instruction. A big part of creating a good compiler is to handle "all" such cases, and graphical shaders have different constructs compared to typical C/C++ code, so you need to invest lots of time looking at real world shaders.

1 For example, 500MHz, 30fps, 1920x1080 translates to 8 cycles/pixel. Most GPUs have multiple cores (or whatever they are called — all GPU vendors have different terminology), so the cycle budget is larger for most devices. But still rather limited.