tag:blogger.com,1999:blog-24208699919749357342024-03-14T01:16:16.934+01:00Krister Walfridsson’s old blogRandom thoughts on programming languages, compilers, operating systems, etc.Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.comBlogger78125tag:blogger.com,1999:blog-2420869991974935734.post-84793395242332909242021-10-20T00:27:00.001+02:002021-10-20T00:27:26.778+02:00This blog has moved<p style="text-align: left;">I have started blogging again!</p><p style="text-align: left;">But I am too annoyed at Blogger, so the new posts will be published in my new blog: <a href="https://kristerw.github.io/">https://kristerw.github.io/</a>.</p>Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com0tag:blogger.com,1999:blog-2420869991974935734.post-57390030416962502582020-02-02T10:42:00.000+01:002020-02-02T18:45:18.017+01:00Watching for software inefficiencies with Valgrind<div style="text-align: justify;">
Software inefficiencies often show up as wasteful memory operations, and many research tools can detect various cases of this. One such tool is described in the paper “<a href="https://www.researchgate.net/publication/323947597_Watching_for_Software_Inefficiencies_with_Witch">Watching for Software Inefficiencies with Witch</a>”, which is using the Intel CPU performance monitoring units (PMU) and debug registers to find
<br />
<ul>
<li>dead stores – writing a value that is never read,</li>
<li>silent stores – writing the same value that was already present at the memory location,</li>
<li>silent loads – reading an unchanged value from memory that was previously read.</li>
</ul>
Most of the dead/silent memory operations are not problematic, and the analyzed program would most likely become slower if we tried to get rid of them by making the code more complicated. But there are cases, such as mostly dead stores of the form
<br />
<pre class="code-snippet">*p = expensive_computation();</pre>
where it may make sense to reorganize the code to only do the expensive computation when the value is needed.<br />
<br />
I thought it would be fun to play with this, but the “Witch” tools need a custom Linux kernel which I found too complicated to set up… But I have always wanted to learn how to write a Valgrind tool, and writing a Witch-like tool seemed like a good beginner project (although the main contribution in the “Witch” paper is doing this without performance overhead – that is definitely not the case when using Valgrind…).<br />
<br />
<h2>
The deadstores Valgrind tool</h2>
My Valgrind tool, “<a href="https://github.com/kristerw/deadstores">deadstores</a>”, gathers data from each instruction that is reading and writing memory:<br />
<ul>
<li>For stores, it records how many bytes were written, and how many of those bytes were read before the memory was overwritten (or made invalid by calling <code>free</code>, etc.). It also records how many times the instruction was executed, and how many of those wrote the same value as was there before.</li>
<li>For loads, it records how many times the instruction was executed, and how many of those read a value that had previously been read.</li>
</ul>
The reason dead stores are tracked on a byte granularity is that I want to be able to find obsolete, “write-only”, structure elements that are never read. For example, if we have a structure such as<br />
<pre class="code-snippet"><span style="color: #6600cc;">struct</span> {
<span style="color: #009900;">int</span> <span style="color: #996600;">a</span>, <span style="color: #996600;">b</span>, <span style="color: #996600;">c</span>, <span style="color: #996600;">d</span>;
} <span style="color: #996600;">foo</span>;
</pre>
and code that is initializing the structure as<br />
<pre class="code-snippet">memset(&foo, 0, <span style="color: #6600cc;">sizeof</span>(foo));
</pre>
then the compiler will usually optimize the <code>memset</code> to one SIMD instruction that is clearing all 128 bits. If the program now only reads three of the values, then we will see that 4 of the 16 bytes are dead – instruction granularity would not detect this as the stored value is (partly) used.<br />
<br />
Silent loads and stores are, however, tracked on instruction granularity. The reason is that byte granularity would report spurious silent loads/stores. For example, if <code>sum</code> is a 32-bit <code>int</code> variable containing the value <code>15</code>, and we add <code>10</code> to it<br />
<pre class="code-snippet">sum += 10;
</pre>
then this would be reported as 3 silent stores if we used byte granularity (as three of the bytes are <code>0</code> both before and after the operation).<br />
<br />
<h2>
How to use the tool</h2>
The <code>deadstores</code> tool is used in the same way as other Valgrind tools
<br />
<pre class="code-snippet">valgrind --tool=deadstores [valgrind-options] your-prog [your-prog-options]
</pre>
The program to analyze must be built with debug symbols (<span style="white-space: nowrap;"><code>-g</code></span> or better), and optimizations should be enabled in order to produce relevant data. You also need <span style="white-space: nowrap;"><code>-fno-omit-frame-pointer</code></span> if you are passing <span style="white-space: nowrap;"><code>--use-stack-trace=yes</code></span> (see below) to the tool.<br />
<br />
The result is written to a JSON file containing all data collected during the run. There is an example python script <code>script/print_result.py</code> that extracts the top \(n\) instructions from each of “dead stores”, “silent stores”, and “silent loads”.<br />
<br />
The name of the output file is per default <code>deadstores.out.<pid></code> (where <code><pid></code> is the program’s process ID), but it can be changed by <span style="white-space: nowrap;"><code>--deadstores-out-file=<filename></code></span>. The <code>%p</code>, <code>%q</code>, and <code>%n</code> format specifiers can be used to embed the process ID, the contents of an environment variable, and/or a sequence number in the name, in the same ways as for the core option <code><a href="http://valgrind.org/docs/manual/manual-core.html#opt.log-file">--log-file</a></code>.<br />
<br />
The tool per default tracks the instructions that read/write the memory, but this is often not too useful as it tends to report things like<br />
<pre class="code-snippet">0x054b0a25:
bytes_written: 11916560
bytes_read: 248126
bytes_dead: 11668434
nof_stores: 744785
nof_silent: 19326
at 0x054b0a25: __memset_avx2 (in /build/glibc-Cl5G7W/glibc-2.23/string/../sysdeps/x86_64/multiarch/memset-avx2.S:85)
</pre>
The dead stores here come from the <code>memset</code> implementation in <code>libc</code>, and the values are the sums of all calls to <code>memset</code> in the program. This is not that helpful if we want to find dead stores we can eliminate, but the tool can instead track the loads/stores by the stack trace, which splits the reported data for an instruction according to where it was called from<br />
<pre class="code-snippet">0x054b0a25:
bytes_written: 1707072
bytes_read: 1120
bytes_dead: 1705952
nof_stores: 106692
nof_silent: 2500
at 0x054b0a25: __memset_avx2 (in /build/glibc-Cl5G7W/glibc-2.23/string/../sysdeps/x86_64/multiarch/memset-avx2.S:85)
by 0x00b317aa: poison_pages() (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc-page.c:2125)
by 0x00b3362c: ggc_collect() (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc-page.c:2226)
by 0x00bb82d4: cgraph_node::finalize_function(tree_node*, bool) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/cgraphunit.c:494)
by 0x00a60e43: expand_or_defer_fn(tree_node*) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/cp/semantics.c:4486)
0x054b0a25:
bytes_written: 521808
bytes_read: 12
bytes_dead: 521796
nof_stores: 32613
nof_silent: 0
at 0x054b0a25: __memset_avx2 (in /build/glibc-Cl5G7W/glibc-2.23/string/../sysdeps/x86_64/multiarch/memset-avx2.S:85)
by 0x00d127fe: ggc_internal_cleared_alloc(unsigned long, void (*)(void*), unsigned long, unsigned long) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc-common.c:118)
by 0x012b796e: make_node(tree_code) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/ggc.h:143)
by 0x012b7c96: build_tree_list(tree_node*, tree_node*) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/tree.c:3184)
by 0x009e32d6: cp_parser_parameter_declaration_list(cp_parser*, int) (in /scratch/gcc-trunk/build/gcc/../../gcc/gcc/cp/parser.c:22525)
...
</pre>
This is enabled by <span style="white-space: nowrap;"><code>--use-stack-trace=yes</code></span> which in general makes the result much more useful (and the tool much slower…). The program to analyze must be compiled width frame pointer support (<span style="white-space: nowrap;"><code>-fno-omit-frame-pointer</code></span>) in order to get a reliable result.<br />
<br />
Most of the <a href="http://valgrind.org/docs/manual/manual-core.html">Valgrind core options</a> are also supported, such as <span style="white-space: nowrap;"><code>--trace-children=yes</code></span> which is useful when analyzing programs (such as <code>gcc</code>) that start subprocesses using the <code>exec</code> system call.<br />
<br />
<h2>
Limitations</h2>
Valgrind is not an optimal framework for writing this kind of tool. The reason is that it works by translating the assembly instructions to an internal, target-independent, representation before it is analyzed/instrumented by the tools, and this translation does not preserve all details of the instructions. This is not a problem for most instructions – for example, a simple addition instruction
<br />
<pre class="code-snippet">add %rcx,%rax
</pre>
is translated in a natural way to an IR instruction adding two values<br />
<pre class="code-snippet">t10 = Add64(t7,t4)
</pre>
But some instructions are translated in a way that causes problems. For example, the bit test instruction
<br />
<pre class="code-snippet">bt %rax,%r8
</pre>
is generated as a sequence that is storing 8 bytes on an unused address on the stack and reading back one byte of the result
<br />
<pre class="code-snippet">t75 = GET:I64(48) ; <-- Get stack pointer
t74 = Sub64(t75,0x120:I64)
PUT(48) = t74
STle(t74) = t73 ; <-- Store 8 bytes
t18 = And64(t36,0x3F:I64)
t78 = Sar64(t18,0x3:I8)
t77 = Add64(t74,t78)
t80 = And64(t18,0x7:I64)
t112 = 64to8(t80)
t79 = t112
t15 = LDle:I8(t77) ; <-- Load 1 byte
t84 = GET:I64(168)
t113 = amd64g_calculate_rflags_all[mcx=0x9]{0x581732b0}(0x13:I64,t61,0x0:I64,t84):I64
t85 = t113
t114 = 8Uto64(t15)
t89 = t114
t88 = Shr64(t89,t79)
t87 = And64(t88,0x1:I64)
t90 = And64(t85,0x8D4:I64)
t86 = Or64(t90,t87)
PUT(144) = 0x0:I64
PUT(160) = 0x0:I64
PUT(152) = t86
PUT(168) = 0x0:I64
t91 = Add64(t74,0x120:I64)
PUT(48) = t91
</pre>
This makes the <code>deadstores</code> tool treat this as an 8-byte store where 7 of those bytes are dead… I have worked around this particular case (by ignoring stores to offset <code>0</code> from the stack pointer – this holds the pointer to the stack frame for normal functions, so these memory accesses are not relevant for the tool anyway), but there may be additional cases I have not found yet…<br />
<br />
One other limitation of the <code>deadstores</code> tool is that the tracking of silent loads only handles memory addresses that have been stored to, so it does not detect silent loads in read-only data segments. This could be fixed by tracking valid memory in the same way as done by the <code>memcheck</code> tool.<br />
<br />
<h2>
Implementation details</h2>
<h3>
Structures</h3>
The tool keeps track of statistics for each instruction reading/writing memory (number of bytes written, number of times executed, etc.) and the status of each byte written (to track if the stored value is later read or re-written with the same value).<br />
<br />
The statistics for each instruction is kept in the <code>node</code> structure<br />
<pre class="code-snippet"><span style="color: #6600cc;">struct</span> <span style="color: #009900;">node</span> {
<span style="color: #009900;">handle</span> <span style="color: #996600;">parent</span>;
<span style="color: #009900;">handle</span> <span style="color: #996600;">left</span>;
<span style="color: #009900;">handle</span> <span style="color: #996600;">right</span>;
<span style="color: #6600cc;">enum</span> <span style="color: #009900;">color</span> <span style="color: #996600;">color</span>;
<span style="color: #6600cc;">struct</span> <span style="color: #009900;">key</span> <span style="color: #996600;">key</span>;
<span style="color: #990000;">// Store</span>
<span style="color: #009900;">Long</span> <span style="color: #996600;">bytes_written</span>;
<span style="color: #009900;">Long</span> <span style="color: #996600;">bytes_read</span>;
<span style="color: #009900;">Long</span> <span style="color: #996600;">nof_stores</span>;
<span style="color: #009900;">Long</span> <span style="color: #996600;">nof_silent_stores</span>;
<span style="color: #990000;">// Load</span>
<span style="color: #009900;">Long</span> <span style="color: #996600;">nof_loads</span>;
<span style="color: #009900;">Long</span> <span style="color: #996600;">nof_silent_loads</span>;
};
</pre>
The <code>node</code> structures are stored in an array called <code>nodes</code>, and the nodes are organized as a red-black tree keyed by the address of the instruction (or rather, “generalized address” – it contains the top 5 addresses of the stack trace when <span style="white-space: nowrap;"><code>--use-stack-trace=yes</code></span> is used). The tree structure is managed by the <code>parent</code>, <code>left</code>, and <code>right</code> fields – these are conceptually pointers to <code>node</code> structures, but we are using handles (that are just the index into the <code>nodes</code> array) to make them valid even if we need to <code>realloc</code> the <code>nodes</code> array to increase its size.<br />
<br />
Each byte accessed by the program gets an <code>addr_info</code> structure for that address (these are allocated in blocks of <code>1<<20</code> addresses, and the blocks are stored in a hash table <code>addr_info_table</code>).
<br />
<pre class="code-snippet"><span style="color: #6600cc;">struct</span> <span style="color: #009900;">addr_info</span> {
<span style="color: #009900;">unsigned</span> <span style="color: #996600;">store_instr_h</span> : 31;
<span style="color: #009900;">unsigned</span> <span style="color: #996600;">is_written</span> : 1;
};
</pre>
The <code>addr_info</code> structure contains 1 bit telling if the byte is valid (that is, if it has been written to, and not been invalidated by the memory being <code>free</code>ed etc.), and a handle to the instruction that wrote to this byte. Programs do not use that many load/store instructions, so the handle is stored in 31 bits to make the structure as structure fit in 32 bits. This makes a big difference when tracing memory-hungry programs...<br />
<br />
<h3>
Instrumentation</h3>
Valgrind work by transforming the binary code to its own internal representation (IR) and the tools can then add instrumentation to the IR before Valgrind JIT-compiles and executes the code. The <code>deadstores</code> tool adds its instrumentation in the function <code>ds_instrument</code>.<br />
<br />
The code in <code>ds_instrument</code> iterates through the IR to find the instructions that access memory – the most important are <code>Ist_WrTemp</code> (load) and <code>Ist_Store</code> (store), but there are a few others (such as the atomic compare-and-swap <code>Ist_CAS</code>). Each such instruction is instrumented using the <code>Ist_Dirty</code> IR instruction that is used to call a helper function. The <code>deadstores</code> tool has one such helper function for tracking each of “dead stores”, “silent stores”, and “loads” (each instruction that writes memory gets both of the “dead stores” and “silent stores” helper functions, and the “silent stores” helper function must be called first).<br />
<br />
The helper functions used by the <code>deadstores</code> tool take three parameters – the address and size of the memory access, and the address of the instruction. The function that is tracking dead stores updates the counters for the instruction (that the instruction has been executed and written <code>size</code> bytes) and the status of each written byte (that it has been written, and which instruction that wrote the value).<br />
<pre class="code-snippet"><span style="color: #6600cc;">static</span> <span style="color: #009900;">void</span> <span style="color: #000099;">dead_store_tracker</span>(<span style="color: #009900;">Addr</span> <span style="color: #996600;">addr</span>, <span style="color: #009900;">SizeT</span> <span style="color: #996600;">size</span>, <span style="color: #009900;">Addr</span> <span style="color: #996600;">iaddr</span>)
{
<span style="color: #009900;">handle</span> <span style="color: #996600;">h</span> = iaddr2handle(iaddr);
nodes[h].nof_stores += 1;
nodes[h].bytes_written += size;
<span style="color: #6600cc;">for</span> (<span style="color: #009900;">SizeT</span> <span style="color: #996600;">i</span> = 0; i < size; i++) {
<span style="color: #6600cc;">struct</span> <span style="color: #009900;">addr_info</span> *<span style="color: #996600;">ai</span> = addr2addr_info(addr + i);
ai->is_written = True;
ai->store_instr_h = h;
}
}
</pre>
The load tracking checks each byte’s <code>store_instr_h</code> to see which instruction wrote this byte. If this is different from <code>0</code>, then the instruction corresponding to the handle has its <code>bytes_read</code> counter incremented, and the byte’s <code>store_instr_h</code> is set to <code>0</code> to mark it as read. If all of the bytes were written but has a zero for <code>store_instr_h</code>, then the load is silent, and the load instruction’s counter is updated accordingly.<br />
<pre class="code-snippet"><span style="color: #6600cc;">static</span> <span style="color: #009900;">void</span> <span style="color: #000099;">load_tracker</span>(<span style="color: #009900;">Addr</span> <span style="color: #996600;">addr</span>, <span style="color: #009900;">SizeT</span> <span style="color: #996600;">size</span>, <span style="color: #009900;">Addr</span> <span style="color: #996600;">iaddr</span>)
{
<span style="color: #009900;">SizeT</span> <span style="color: #996600;">silent_load</span> = 0;
<span style="color: #6600cc;">for</span> (<span style="color: #009900;">SizeT</span> <span style="color: #996600;">i</span> = 0; i < size; i++) {
<span style="color: #6600cc;">struct</span> <span style="color: #009900;">addr_info</span> *<span style="color: #996600;">ai</span> = addr2addr_info(addr + i);
<span style="color: #6600cc;">if</span> (ai->store_instr_h != 0) {
nodes[ai->store_instr_h].bytes_read++;
ai->store_instr_h = 0;
} <span style="color: #6600cc;">else</span> <span style="color: #6600cc;">if</span> (ai->is_written) {
silent_load += 1;
}
}
<span style="color: #009900;">handle</span> <span style="color: #996600;">h</span> = iaddr2handle(iaddr);
nodes[h].nof_loads += 1;
<span style="color: #6600cc;">if</span> (silent_load == size) {
nodes[h].nof_silent_loads += 1;
}
}
</pre>
The tracking of silent stores needs a bit more work as it must compare the value of the original memory with the newly written. This is cannot be done in the helper function, so it is done by inserting new IR that does the comparison, and the helper function is not called unless the values are identical. This functionality is partially supported by Valgrind – the <code>Ist_Dirty</code> instruction has an optional <code>guard</code> parameter taking an IR expression determining if the function is to be called, so the only thing needed is to generate the comparison and pass it as the guard. And this comparison is straight forward to emit – if <code>st</code> is an <code>IRExpr</code> pointer to a 32-bit <code>Ist_Store</code>, then the IR for the comparison can be emitted as<br />
<pre class="code-snippet"><span style="color: #009900;">IRExpr</span> *<span style="color: #996600;">data</span> = st->Ist.Store.data;
<span style="color: #009900;">IRType</span> <span style="color: #996600;">type</span> = typeOfIRExpr(tyenv, data);
<span style="color: #009900;">IRExpr</span> *<span style="color: #996600;">addr</span> = st->Ist.Store.addr;
tl_assert(type == Ity_I32);
<span style="color: #009900;">IRExpr</span> *<span style="color: #996600;">load</span> = emit_load(sb, type, addr);
<span style="color: #009900;">IRExpr</span> *<span style="color: #996600;">guard</span> = emit_binop(sb, Iop_CmpEQ32, data, load);
</pre>
where the <code>emit_</code>-functions are simple wrappers around the Valgrind API to get rid of some boilerplate code (<code>sb</code> is the IR superblock containing the instructions – this is one of the inputs to <code>ds_instrument</code>).<br />
<br />
The only thing left to do for the silent store tracker is to check that all addresses have been written to (in order to prevent reporting spurious silent stores for, for example, the first write to <code>malloc</code>ed memory that by luck happens to contain the same value that is stored by the instruction), and update the instruction’s <code>silent_stores</code> counter if they have.<br />
<pre class="code-snippet"><span style="color: #6600cc;">static</span> <span style="color: #009900;">void</span> <span style="color: #000099;">silent_store_tracker</span>(<span style="color: #009900;">Addr</span> <span style="color: #996600;">addr</span>, <span style="color: #009900;">SizeT</span> <span style="color: #996600;">size</span>, <span style="color: #009900;">Addr</span> <span style="color: #996600;">iaddr</span>)
{
<span style="color: #009900;">Bool</span> <span style="color: #996600;">is_written</span> = True;
<span style="color: #009900;">for</span> (<span style="color: #009900;">SizeT</span> <span style="color: #996600;">i</span> = 0; i < size; i++) {
<span style="color: #6600cc;">struct</span> <span style="color: #009900;">addr_info</span> *<span style="color: #996600;">ai</span> = addr2addr_info(addr + i);
is_written = is_written && ai->is_written;
}
<span style="color: #6600cc;">if</span> (is_written) {
<span style="color: #009900;">handle</span> <span style="color: #996600;">h</span> = iaddr2handle(iaddr);
nodes[h].nof_silent_stores += 1;
}
}
</pre>
<br />
<h3>
Callbacks</h3>
It is possible to register callbacks that Valgrind calls for various things that cannot be done in the IR instrumentation (or that most tools must do, so it is better to have a common mechanism for this). Some examples of this are <code>track_new_mem_stack</code> and <code>track_die_mem_stack</code> that are called each time the stack pointer is decreased/increased, or the <code>needs_malloc_replacement</code> callbacks that are called when the program is calling <code>malloc</code>, <code>free</code>, etc.<br />
<br />
The <code>deadstores</code> tool is using these in order to mark the memory as invalid (that is, clearing the <code>is_written</code> flag) to prevent counters being incorrectly updated by the silent load/store tracking when, for example, previously freed memory is returned from <code>malloc</code>.<br />
<br />
<h2>
TODO</h2>
There are many things that could be improved:<br />
<ul>
<li>The tool is slow. I have not done any profiling, but I would guess most of the time is spent in <code>iaddr2handle</code> and <code>addr2addr_info</code>. It is easy to improve how this is done.</li>
<li>Add a test suite.</li>
<li>Better tracking of valid memory (as described in the “limitations” section above).</li>
<li>Better tracking of variables and structures (to automatically find write-only variables and structure elements).</li>
<li>Finding other types of inefficiencies, such as <code>memcpy</code> where the program could have used the original data.</li>
<li>…</li>
</ul>
But I do not expect to do this anytime soon (if ever…) as my main goal was just to learn how to write a Valgrind tool…</div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com1tag:blogger.com,1999:blog-2420869991974935734.post-43430554469423729162019-04-28T18:23:00.000+02:002019-05-07T20:07:49.415+02:00How LLVM optimizes power sums<div style="text-align: justify;">
LLVM optimizes power sums, such as
<br />
<pre class="code-snippet">int sum(int count)
{
int result = 0;
for (int j = 0; j < count; ++j)
result += j*j;
return result;
}
</pre>
to code that calculates the result without a loop (<a href="https://godbolt.org/z/ZJhD05" target="_blank">go<span id="goog_2000935017"></span><span id="goog_2000935018"></span>dbolt</a>)
<br />
<pre class="code-snippet">sum(int):
test edi, edi
jle .LBB0_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
lea eax, [rdi - 3]
imul rax, rcx
shr rax
imul eax, eax, 1431655766
add eax, edi
shr rcx
lea ecx, [rcx + 2*rcx]
lea eax, [rax + rcx]
add eax, -1
ret
.LBB0_1:
xor eax, eax
ret</pre>
It handles more complex cases too (<a href="https://godbolt.org/z/mWknyx" target="_blank">godbolt</a>) – that is, the optimization is not just a simple pattern matcher. This post will show how the optimization is done.
<br />
<br /></div>
<h2 style="text-align: justify;">
Loop analysis – scalar evolution</h2>
<div style="text-align: justify;">
There are many cases where compilers need to track how values are updated within a loop. For example, the loop vectorizer needs to check that the pointers are moved to the adjacent element in the next iteration, and check that no other pointer indexing may alias the range we are vectorizing.<br />
<br /></div>
<div style="text-align: justify;">
Both GCC and LLVM does this in the same way in their scalar evolution passes, where each variable at iteration \(i\) (we start enumerating iterations from \(0\)) is represented as a function \(f_0(i)\) defined as a linear recurrence of the form<br />
\[f_j(i) =<br />
\begin{cases}<br />
\phi_j & \text{if $i = 0$} \\<br />
f_j(i-1) \odot_{j+1} f_{j+1}(i-1) & \text{if $i > 0$}<br />
\end{cases}\] where \(\odot\in\{+,*\}\).<br />
<br />
<h3>
Example 1</h3>
The simplest case is a loop such as<br />
<pre class="code-snippet">void foo(int m, int *p)
{
for (int j = 0; j < m; j++)
*p++ = j;
}
</pre>
The loop writes \(0\) to <code>*p++</code> in the first iteration, \(1\) in the second, etc. So we can express the value written at iteration \(i\) as\[f(i) =<br />
\begin{cases}<br />
0 & \text{if $i = 0$} \\<br />
f(i-1) + 1 & \text{if $i > 0$}<br />
\end{cases}\]<br />
<h3>
Example 2</h3>
Polynomials in the iteration variable can also be expressed in this form.<br />
<pre class="code-snippet">void foo(int m, int k, int *p)
{
for (int j = 0; < m; j++)
*p++ = j*j*j - 2*j*j + k*j + 7;
}
</pre>
We will see below how to build the functions, but the result for the value stored in this loop is \[\begin{align}f_2(i) & =<br />
\begin{cases}<br />
2\phantom{f_0(i-1) + f_1(i-1)} & \text{if $i = 0$} \\<br />
f_2(i-1) + 6 & \text{if $i > 0$}<br />
\end{cases}\\<br />
f_1(i) & =<br />
\begin{cases}<br />
k-1 & \text{if $i = 0$} \\<br />
f_1(i-1) + f_2(i-1)\phantom{2} & \text{if $i > 0$}<br />
\end{cases}\\<br />
f(i) = f_0(i) & =<br />
\begin{cases}<br />
7 & \text{if $i = 0$} \\<br />
f_0(i-1) + f_1(i-1)\phantom{2} & \text{if $i > 0$}<br />
\end{cases}\end{align}\] One optimization we can see directly from these functions is that the value can be calculated by just three additions within the loop<br />
<pre class="code-snippet">void foo(int m, int k, int *p)
{
int t0 = 7;
int t1 = k-1;
int t2 = 2;
for (int j = 0; j < m; j++) {
*p++ = t0;
t0 = t0 + t1;
t1 = t1 + t2;
t2 = t2 + 6;
}
}
</pre>
which is a useful optimization for architectures were multiplication is expensive. This kind of code is, however, uncommon, so most compilers do not do this optimzation – but they usually do this for simpler cases, such as<br />
<pre class="code-snippet">void foo(int m, int k, int *p)
{
for (int j = 0; < m; j++)
*p++ = k*j + 7;
}
</pre>
as constructs of the form <code>k*j+7</code> are common in address calculations.<br />
<br />
<div style="text-align: justify;">
<h2>
Chains of recurrences</h2>
It is cumbersome to write the recursive functions all the time, so the functions are usually written in the form \(\{\phi_j, \odot_{j+1}, f_{j+1}\}\). For example \[\begin{align}f_2(i) & =<br />
\begin{cases}<br />
2\phantom{f_0(i-1) + f_1(i-1)} & \text{if $i = 0$} \\<br />
f_2(i-1) + 6 & \text{if $i > 0$}<br />
\end{cases} \phantom{xx}\text{is written as $\{2,+,6\}$}\\<br />
f_1(i) & =<br />
\begin{cases}<br />
k-1 & \text{if $i = 0$} \\<br />
f_1(i-1) + f_2(i-1)\phantom{2} & \text{if $i > 0$}<br />
\end{cases} \phantom{xx}\text{is written as $\{k-1,+,f_2\}$}\\<br />
f(i) = f_0(i) & =<br />
\begin{cases}<br />
7 & \text{if $i = 0$} \\<br />
f_0(i-1) + f_1(i-1)\phantom{2} & \text{if $i > 0$}<br />
\end{cases} \phantom{xx}\text{is written as $\{7,+,f_1\}$}\end{align}\] These can be chained, so \(f(i)\) can be written as a <i>chain of recurrences</i> (CR) \(\{7,+,\{k-1,+,\{2,+,6\}\}\}\). The inner curly braces are redundant, so the CR is usually written as a tuple \(\{7,+,k-1,+,2,+,6\}\).<br />
<br />
<h2>
Building the chains of recurrences</h2>
The chains of recurrences are built by iterating over the code and calculating the CR for the result of each operation (or marking it as unknown if we cannot handle the operation), using simplification rules such as\[\begin{align}c * \{\phi_0, +, \phi_1\} & \phantom{xx} \Rightarrow \phantom{xx} \{c * \phi_0, +, c * \phi_1\} \\<br />
\{\phi_0, +, \phi_1\} + \{\psi_0, +, \psi_1\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0 + \psi_0, +, \phi_1 + \psi_1\} \\<br />
\{\phi_0, +, \phi_1\}* \{\psi_0, +, \psi_1\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0 * \psi_0, +, \psi_1 * \{\phi_0, +, \phi_1\} + \phi_1 * \{\psi_0, +, \psi_1\} + \phi_1*\psi_1\} \\<br />
\{\phi_0, +, \phi_1,+,0\} & \phantom{xx} \Rightarrow \phantom{xx} \{\phi_0, +, \phi_1\}\end{align}\] So for the loop from the <code>sum</code> function<br />
<pre class="code-snippet">for (int j = 0; j < count; ++j)
result += j*j;
</pre>
we start with <code>j</code> which we know has the CR \(\{0,+,1\}\) per Example 1. This is then used as <code>j*j</code> when calculating <code>result</code>, so we calculate the CR for <code>j*j</code> using the simplification rules as \[\begin{align}j*j& = \{0,+,1\} * \{0,+,1\} \\<br />
& = \{0 * 0, +, 1 * \{0, +,1\} + 1 * \{0, +, 1\} + 1*1\} \\<br />
& = \{0, +, 1,+,2\}\end{align}\]
Similar calculations for <code>result</code> gives us the CR \(\{0,+,0,+,1,+,2\}\) for the value at the beginning of the iteration, and \(\{0,+,1,+,3,+,2\}\) after adding <code>j*j</code>.<br />
<br /></div>
<h2 style="text-align: justify;">
Doing the optimization</h2>
<div style="text-align: justify;">
The optimization is done during induction variable simplification, and LLVM has transformed the function to a form more convenient for analysis and optimization<br />
<pre class="code-snippet">int sum(int count)
{
int result = 0;
if (count > 0) {
int j = 0;
do {
result = result + j*j;
++j;
} while (j < count);
}
return result;
}
</pre>
or as it looks like in the LLVM IR <br />
<pre class="code-snippet">define i32 @sum(i32) {
%2 = icmp sgt i32 %0, 0
br i1 %2, label %3, label %6
; <label>:3:
br label %8
; <label>:4:
%5 = phi i32 [ %12, %8 ]
br label %6
; <label>:6:
%7 = phi i32 [ 0, %1 ], [ %5, %4 ]
ret i32 %7
; <label>:8:
%9 = phi i32 [ %13, %8 ], [ 0, %3 ] <span style="color: #aaaaaa;">; {0,+,1}</span>
%10 = phi i32 [ %12, %8 ], [ 0, %3 ] <span style="color: #aaaaaa;">; {0,+,0,+,1,+,2}</span>
%11 = mul nsw i32 %9, %9 <span style="color: #aaaaaa;">; {0,+,1,+,2}</span>
%12 = add nuw nsw i32 %11, %10 <span style="color: #aaaaaa;">; {0,+,1,+,3,+,2}</span>
%13 = add nuw nsw i32 %9, 1 <span style="color: #aaaaaa;">; {1,+,1}</span>
%14 = icmp slt i32 %13, %0
br i1 %14, label %8, label %4
}
</pre>
The compiler can see that the function returns <code>0</code> if <code>count <= 0</code>, otherwise it returns the value of <code>result</code> at loop iteration <code>count-1</code>.<br />
<br />
One nice property of the chains of recurrences is that it is easy to calculate the value at a specific iteration – if we have a CR \(\{\phi_0,+,\phi_1,+,\ldots,+,\phi_n\}\), then the value at iteration \(i\) can be calculated as \[\begin{align}f(i) & = \sum_{j=0}^{n}\phi_j{i \choose j} \\ & = \phi_0 + \phi_1i + \phi_2{i(i-1)\over 2!} + \ldots + \phi_n{i(i-1)\cdots(i-n+1)\over n!}\end{align}\] Inserting the values for the CR \(\{0,+,1,+,3,+,2\}\) describing <code>result</code> gives us \[f(i) = i + {3i(i-1)\over 2} + {i(i-1)(i-2) \over 3}\] The compiler now only need to insert code that calculates this with \(i =\) <code>count-1</code>, after the loop<br />
<pre class="code-snippet">result = count-1 + 3*(count-1)*(count-2)/2 + (count-1)*(count-2)(count-3)/3;
</pre>
but it need to take some care to calculate in the correct precision (as temporary values may not fit in 32-bit integers). <a href="https://kristerw.blogspot.com/2017/05/integer-division-is-slow.html">Integer division is slow</a>, so it is also doing tricks to replace the divisions with multiplication and shift instructions. The result is the LLVM IR<br />
<pre class="code-snippet"> %4 = add i32 %0, -1
%5 = zext i32 %4 to i33
%6 = add i32 %0, -2
%7 = zext i32 %6 to i33
%8 = mul i33 %5, %7
%9 = add i32 %0, -3
%10 = zext i32 %9 to i33
%11 = mul i33 %8, %10
%12 = lshr i33 %11, 1
%13 = trunc i33 %12 to i32
%14 = mul i32 %13, 1431655766
%15 = add i32 %14, %0
%16 = lshr i33 %8, 1
%17 = trunc i33 %16 to i32
%18 = mul i32 %17, 3
%19 = add i32 %15, %18
%20 = add i32 %19, -1
</pre>
Inserting this makes the loop become dead, so it is later removed by dead code elimination, and we eventually end up with the code<br />
<pre class="code-snippet">sum(int):
test edi, edi
jle .LBB0_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
lea eax, [rdi - 3]
imul rax, rcx
shr rax
imul eax, eax, 1431655766
add eax, edi
shr rcx
lea ecx, [rcx + 2*rcx]
lea eax, [rax + rcx]
add eax, -1
ret
.LBB0_1:
xor eax, eax
ret
</pre>
<br />
<h2 style="text-align: justify;">
Performance</h2>
This optimization is not always profitable. For example, <br />
<pre class="code-snippet">int sum(int count)
{
int result = 0;
for (int j = 0; j < count; ++j)
result += j*j*j*j*j*j;
return result;
}</pre>
is calculated by three 32-bit multiplications and one addition within the loop, while the optimized version needs six 64-bit multiplications, five 32-bit multiplications, and a slew of other instructions (<a href="https://godbolt.org/z/tV9Rii" target="_blank">godbolt</a>), so the optimized version is slower for small values of <code>count</code>.
I benchmarked on my PC, and <code>count</code> must be larger than <code>5</code> for the optimized version to be faster than the loop. Smaller CPUs with, for example, more expensive 64-bit multiplication, will need a higher <code>count</code> for the optimization to help. And CPUs not having instructions for 64-bit multiplication (<a href="https://godbolt.org/z/KNBl0Z" target="_blank">godbolt</a>) will need a <i>much</i> higher <code>count</code>.<br />
<br />
One problem with this optimization is that it is hard for developers to make the compiler generate a loop for this if they know that the majority of values used in reality are small enough for the loop to be the fastest option. GCC does, therefore, not replace the final value of a loop if the expression is expensive<br />
<pre class="code-snippet">/* Do not emit expensive expressions. The rationale is that
when someone writes a code like
while (n > 45) n -= 45;
he probably knows that n is not large, and does not want it
to be turned into n %= 45. */
|| expression_expensive_p (def))
</pre>
So GCC not doing this optimization is a feature – not a bug.</div>
<div style="text-align: justify;">
<br />
<h2>
Further readings</h2>
Chains of recurrences:
<br />
<ul>
<li>Olaf Bachmann, Paul S. Wang, Eugene V. Zima. “<a href="https://bohr.wlu.ca/ezima/papers/ISSAC94_p242-bachmann.pdf">Chains of recurrences – a method to expedite the evaluation of closed-form functions</a>”</li>
<li>Eugene V. Zima. “<a href="https://bohr.wlu.ca/ezima/papers/ISSAC01_p345-zima.pdf">On computational properties of chains of recurrences</a>”</li>
</ul>
Loop optimizations using chains of recurrences:
<br />
<ul>
<li>Robert A. van Engelen. “<a href="https://pdfs.semanticscholar.org/d247/e00ddea346258916c19b1480048ce0fd20be.pdf">Symbolic Evaluation of Chains of Recurrences for Loop Optimization</a>”</li>
<li>Robert A. van Engelen. “<a href="https://www.cs.fsu.edu/~engelen/cc01.pdf">Efficient Symbolic Analysis for Optimizing Compilers</a>”</li>
</ul>
Optimizing division using multiplication and shift instructions:
<br />
<ul>
<li>Torbjörn Granlund, Peter L. Montgomery. “<a href="https://gmplib.org/~tege/divcnst-pldi94.pdf">Division by Invariant Integers using Multiplication</a>”</li>
</ul>
<div>
<br /></div>
<div>
<i>Updated: The original post incorrectly called the power sums “geometric sums”.</i></div>
</div>
</div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com0tag:blogger.com,1999:blog-2420869991974935734.post-469701020940533192019-01-26T20:09:00.000+01:002019-01-26T20:09:55.075+01:00Building GCC 1.27<div style="text-align: justify;">
GCC 1.27 was released in 1988 and is the first version of GCC supporting the x86 CPU. I thought it would be fun to make it work on my desktop computer.</div>
<div style="text-align: justify;">
<br />
Mikhail Maltsev wrote a great blog post about this a while back “<a href="https://miyuki.github.io/2017/10/04/gcc-archaeology-1.html">Building and using a 29-year-old compiler on a modern system</a>”. I used Mikhail’s work as a starting point, but I built on a 64-bit Ubuntu system so I needed to update paths and <code>as</code>/<code>ld</code> options for running on a 64-bit OS, and I had much bigger problems with the ancient GCC not understanding the system headers. I also enabled the DBX debug format instead of the UNIX/32V SDB format that GDB does not understand. But I did not need to make that big changes to Mikhail’s patch.<br />
<br />
<div>
It is amazing how well things work – the modern assembler, linker, and debugger handles the code generated by GCC 1.27 without any problems. And command options such as <code>-O</code>, <code>-E</code>, <code>-S</code>, <code>-c</code>, <code>-g</code>, <code>-W</code>, <span style="white-space: nowrap;"><code>-pedantic</code></span>, and <span style="white-space: nowrap;"><code>-fomit-frame-pointer</code></span> does what you expect from using a modern GCC. All the options are documented in the manual page – you can format the text by passing the <code>gcc.1</code> file to <code>man</code> as<br />
<pre class="code-snippet">man -l gcc.1
</pre>
</div>
<div>
<br /></div>
<h3>
How to build GCC 1.27 on Ubuntu</h3>
<div>
I have built the compiler on 64-bit Ubuntu Desktop 16.04 as described below.</div>
<div>
<br /></div>
<h4>
Prerequisites</h4>
<div>
GCC 1.27 is a 32-bit program, so we need to install 32-bit compiler and runtime support<br />
<pre class="code-snippet">sudo apt install gcc-multilib
</pre>
</div>
<div>
<br /></div>
<h4>
Downloading and preparing the source code</h4>
<div>
Download the source code and the patch, and apply the patch as<br />
<pre class="code-snippet">wget https://gcc.gnu.org/pub/gcc/old-releases/gcc-1/gcc-1.27.tar.bz2
wget https://gist.github.com/kristerw/b854b6d285e678452a44a6bcbf7ef86f/raw/gcc-1.27.patch
tar xf gcc-1.27.tar.bz2
cd gcc-1.27
patch -p1 < ../gcc-1.27.patch
</pre>
<br /></div>
<h4>
Configuring the source code</h4>
<div>
The compiler is configured by setting up symbolic links to the correct configuration files<br />
<pre class="code-snippet">ln -s config-i386v.h config.h
ln -s tm-i386v.h tm.h
ln -s i386.md md
ln -s output-i386.c aux-output.c
</pre>
You may want to change where the compiler is installed by updating <code>bindir</code> and <code>libdir</code> in the <code>Makefile</code>. I set them to<br />
<pre class="code-snippet">bindir = /home/kristerw/compilers/gcc-1.27/bin
libdir = /home/kristerw/compilers/gcc-1.27/lib
</pre>
<br /></div>
<h4>
Build and install</h4>
<div>
The compiler is built in two (or three) stages, We start by building it using the system compiler<br />
<pre class="code-snippet">make
</pre>
We then build it again using the newly built compiler<br />
<pre class="code-snippet">make stage1
make CC=stage1/gcc CFLAGS="-O -Bstage1/ -Iinclude"
</pre>
As a third, optional, step, we build it again using the second compiler and checks that the resulting binaries are identical with the second compiler (if not, then the compiler has miscompiled itself).<br />
<pre class="code-snippet">make stage2
make CC=stage2/gcc CFLAGS="-O -Bstage2/ -Iinclude"
diff cpp stage2/cpp
diff gcc stage2/gcc
diff cc1 stage2/cc1
</pre>
We are now ready to install the compiler</div>
<pre class="code-snippet">make install
</pre>
</div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com1tag:blogger.com,1999:blog-2420869991974935734.post-69226632742573890062018-12-29T19:47:00.000+01:002018-12-29T19:47:23.541+01:00Commodore 64 programming<div style="text-align: justify;">
I have done some Commodore 64 programming over the holidays. The C64 is <i>old</i>, but I think there are things that can be learned from dealing with all the hardware limitations (although there may be better ways of getting this experience with more recent platforms...)<br />
<br /></div>
<h3 style="text-align: justify;">
C64 Hardware</h3>
<div style="text-align: justify;">
The <a href="https://www.masswerk.at/6502/6502_instruction_set.html">6502 instruction set</a> consists of 56 very simple instructions, and the CPU has been completely reversed engineered so you can follow how it does its work – <a href="http://visual6502.org/">Visual 6502</a> even let you visualize <a href="http://visual6502.org/JSSim/expert.html">transistor-level Simulation of the CPU</a> in the browser! The details of the CPU are described in the talk “Reverse Engineering the MOS 6502 CPU”:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/fWqBmmPQP40/0.jpg" frameborder="0" height="266" src="https://www.youtube.com/embed/fWqBmmPQP40?feature=player_embedded" width="320"></iframe></div>
<br />
The rest of the C64 hardware is also well described by a talk – “The Ultimate Commodore 64 Talk”:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/ZsRRCnque2E/0.jpg" frameborder="0" height="266" src="https://www.youtube.com/embed/ZsRRCnque2E?feature=player_embedded" width="320"></iframe></div>
<br />
There are lots of information about C64 available on the net, but I have not found any convenient place that collects the relevant information... But everything I needed (including examples/tutorials) was available at <a href="https://www.c64-wiki.com/">C64 Wiki</a> and <a href="http://codebase64.org/">Codebase 64</a>. The details of the Video Interface Chip is documented in “<a href="http://www.zimmers.net/cbmpics/cbm/c64/vic-ii.txt">The MOS 6567/6569 video controller (VIC-II) and its application in the Commodore 64</a>”, and Andre Weissflog’s <a href="https://github.com/floooh/chips">chip emulators</a> contains very readable code if you want clarifications on what the various hardware registers do.<br />
<br /></div>
<h3 style="text-align: justify;">
Getting started – raster bars</h3>
<div style="text-align: justify;">
My first test was to implement two raster bars and some sprites moving around, just to verify that I understood the basics.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-IQYf8w2U9OU/XCO_l05EkHI/AAAAAAAAEPU/AXgt6sQMPSQ5ImgRA1sOw-ISzDImzzlQwCLcBGAs/s1600/c64.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="542" data-original-width="720" height="240" src="https://1.bp.blogspot.com/-IQYf8w2U9OU/XCO_l05EkHI/AAAAAAAAEPU/AXgt6sQMPSQ5ImgRA1sOw-ISzDImzzlQwCLcBGAs/s320/c64.png" width="320" /></a></div>
<br />
Much of C64 programming centers around getting around hardware limitations by timing the code carefully. For example, the raster bars are drawn in the border of the screen, but the border can only have one color. It is, however, possible to modify the border color while the hardware is drawing the screen (the C64 hardware is designed for old CRT monitors that draw the picture one line at a time using an electron beam), so we can draw the raster bars by changing the border color exactly when a new line starts!<br />
<br />
I had thought the raster bars should be trivial (just enable an interrupt on a line, and change the colors) but the C64 is not fast enough for this – it takes 9-16 cycles to enter the IRQ, so we are already a bit into the screen when we can change the color. And there are only 63 CPU cycles for each line, so we don’t have the time to set up a new interrupt for the next line anyway. We, therefore, need to first synchronize our code with the start of a line, and then write the code (using NOPs etc. to pad it out) so that we change the colors exactly every 63 cycles.<br />
<br />
But there are more complications – some lines have less than 63 cycles available to do the work. The reason is that the VIC-II chip steals cycles from the CPU for lines where it must fetch extra data. There are two cases:<br />
<ul>
<li>The first raster line of each text line must fetch the characters, which steals one cycle per character. These lines are usually called “bad lines”.</li>
<li>Each sprite that is enabled on the raster line steals 2 cycles.</li>
</ul>
There is, in addition, an overhead to this cycle stealing mechanism that may waste up to 3 cycles per raster line (depending on what the CPU is doing when the cycle stealing starts).<br />
<br />
I was lazy and ensured that my raster bars were not on a bad line and that there were no sprites on the same raster line, but careful programming can handle this kind of things too. The talk “Behind the scenes of a C64 demo” mentions some insane techniques, such as treating the cycle counter as an address and jump to it (this jumps to different addresses depending on how many cycles have executed, and careful layout of the code can make this compensate for differences in execution time).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/po9IY5Kf0Mo/0.jpg" frameborder="0" height="266" src="https://www.youtube.com/embed/po9IY5Kf0Mo?feature=player_embedded" width="320"></iframe></div>
<br />
<h3>
Source code</h3>
The source code for my test program is available below, and you can build it using the <a href="https://sourceforge.net/projects/acme-crossass/">ACME Cross-Assembler</a> as<br />
<pre class="code-snippet">acme -o test.prg test.asm
</pre>
The resulting program <code>test.prg</code> can be run in the <a href="http://vice-emu.sourceforge.net/">VICE</a> emulator by loading it using the “Smart attach Disk/Tape” option.<br />
<br />
<script src="https://gist.github.com/kristerw/59de3f1005a8d20e4fb56a9e20369491.js"></script>
</div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com0tag:blogger.com,1999:blog-2420869991974935734.post-991559990331674862018-10-21T18:22:00.000+02:002019-11-02T15:10:53.202+01:00Optimization and performance measurement<div style="text-align: justify;">
The performance of programs depend surprisingly much on things that “should not” matter (such as the <a href="https://www.seas.upenn.edu/~cis501/papers/producing-wrong-data.pdf">order files are passed to the linker</a>), and modifying one part of the program may change the performance of other parts.<br />
<br />
<h3>
An example</h3>
We will look at the <a href="http://www.libquantum.de/">libquantum</a> library as an example. Configuring and compiling this as<br />
<pre class="code-snippet">./configure
make demos
</pre>
produces two demo applications. Running the <code>shor</code> application as<br />
<pre class="code-snippet">shor 456 427
</pre>
takes 11.0 seconds<sup>1</sup> on my PC (Ubuntu 16.04, Intel Broadwell CPU).<br />
<br />
The performance of the <code>shor</code> application is relatively sensitive to changes, and we can see surprising performance differences when modifying the code. For example, adding this function<br />
<pre class="code-snippet"><span style="color: #00aaaa;">void</span> <span style="color: #00aa00;">foo</span>(<span style="color: #00aaaa;">int</span> *a, <span style="color: #00aaaa;">int</span> *b, <span style="color: #00aaaa;">int</span> *c, <span style="color: #00aaaa;">int</span> n)
{
<span style="color: #0000aa;">for</span> (<span style="color: #00aaaa;">int</span> i = <span style="color: #009999;">0</span>; i < n; i++)
*a++ = *b++ + *c++;
}
</pre>
at the top of <code>measure.c</code> makes the program need 11.8 seconds to run – it has become 7% slower by just adding an unused function!<br />
<br />
<h3>
Why we get the performance difference</h3>
The only effect of adding the unused function is that some other functions are moved to different addresses<sup>2</sup>, so it is natural to attribute the performance difference to “cache effects“. But there are several other hardware bottlenecks that can cause this kind of performance variability – this particular case is due to branch prediction.<br />
<br />
Branch predictors keep track of the histories of branches by building various tables in hardware (see Dan Luu’s <a href="https://danluu.com/branch-prediction/">great blog post</a>). But the branch prediction hardware looks only at a few bits of the addresses, so several branches may map to the same slot, with the result that the branch predictor cannot differentiate between them. Changing the size of one code segment makes other parts of the program move to different addresses, and some branches may be unlucky in that they now alias with branches behaving in a different way, which cause the performance to regress due to the excessive misprediction.<br />
<br />
<h3>
Impact of this when optimizing</h3>
All books and conference talks about optimization tell you to always measure your change to verify that it actually improves the performance before committing. This is good advice, but just measuring the running time is not enough! For example, if our change to libquantum was a real improvement that made the code 5% faster, then we would still have seen a 2% regression due to the (unrelated) branch prediction issues. But this regression is just us being unlucky with code placement, and further changes may make us lucky again, so it would probably have been better long-term to do the change. Similarly, I have seen too many bad changes done because “we measured it, and this made the program faster (but we do not understand why...)“.<br />
<br />
So it is often useful to dig deeper and understand <i>why</i> we get the performance result so that we can attack the root cause and not rely on being lucky – for the libquantum example we could ensure the branches in the sensitive part of the code are located near each other in memory (for example, by forcing the code to be inlined) or try to eliminate some branches by algorithmic changes.<br />
<br />
<h3>
Further reading</h3>
<ul>
<li>“<a href="http://sape.inf.usi.ch/publications/asplos09">Producing Wrong Data Without Doing Anything Obviously Wrong!</a>” has a more detailed discussion about these issues.</li>
<li>“<a href="https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf">Intel 64 and IA-32 Architectures Optimization Reference Manual</a>” describes many of the hardware bottlenecks for Intel CPUs. The principles are (mostly) the same for other CPUs, although the details are different.</li>
</ul>
</div>
<br />
<hr />
<div style="text-align: justify;">
<span style="font-size: x-small;"><span style="font-size: x-small;">1. There are some randomized algorithms in <code>libquantum</code>, so I hardcoded a seed in order to measure the same work in each run: </span></span><br />
<pre class="code-snippet"><span style="font-size: x-small;">--- shor.c.orig 2007-09-03 08:47:55.000000000 +0200
+++ shor.c 2018-10-20 21:18:45.026076220 +0200
@@ -37,7 +37,7 @@
int N;
int c,q,a,b, factor;
- srand(time(0));
+ srand(123456);
if(argc == 1)
{
</span></pre>
<span style="font-size: x-small;">The running time of this is stable (11.0 seconds with a standard deviation of 0.014).</span></div>
<div style="text-align: justify;">
<span style="font-size: x-small;"><span style="font-size: x-small;">2. It could also have made the compiler <a href="https://kristerw.blogspot.com/2018/08/inlining-and-microbenchmarking.html">make different inlining decisions </a>etc., but I have verified that the only difference for this program is that some code is placed on different addresses compared to the original version.</span> </span> <span style="font-size: x-small;"> </span>
</div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com2tag:blogger.com,1999:blog-2420869991974935734.post-9428341403031341752018-08-11T20:58:00.001+02:002018-08-11T22:48:27.676+02:00Inlining and microbenchmarking<div style="text-align: justify;">
Inlining is important for C++ performance, but the compiler must be careful not to increase the code size too much. GCC’s inlining process is basically inlining functions sorted by priority until a growth percentage limit is hit (or until all relevant functions have been inlined), which works fine for large translation units where the parts of the code not helped by inlining can compensate for the code increase in the parts where inlining helps. But it works less well for small translating units that need much inlining, which is common when doing microbenchmarks.<br />
<br />
<div style="text-align: justify;">
Take for example <a href="http://quick-bench.com/IrIpyXu7jJlQEtgDZjC0fm9HXlQ">this quick-bench benchmark</a> from Bartlomiej Filipek’s blog post “<a href="https://www.bfilipek.com/2018/07/string-view-perf-followup.html">Speeding Up <code>string_view</code> String Split Implementation</a>” that measures the performance for different ways of splitting a string.<br />
<br /></div>
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-9A9OonzLbq0/W2b2maBp8-I/AAAAAAAADzI/LVZwkwhcr-gp3_lT9D9nk2t0FmtcrFxdwCLcBGAs/s1600/IrIpyXu7jJlQEtgDZjC0fm9HXlQ.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="554" data-original-width="1108" height="320" src="https://4.bp.blogspot.com/-9A9OonzLbq0/W2b2maBp8-I/AAAAAAAADzI/LVZwkwhcr-gp3_lT9D9nk2t0FmtcrFxdwCLcBGAs/s640/IrIpyXu7jJlQEtgDZjC0fm9HXlQ.png" width="640" /></a></div>
<div style="text-align: justify;">
We can see that <code>StringSplitStd</code> is about 9% faster than <code>StringViewSplit</code> (they have the score 7829 and 7189), but the reason for this difference is that GCC has inlined everything for <code>StringSplitStd</code>, but not for <code>StringViewSplit</code>.<br />
<br />
We get a different result if we run the benchmark with a different set of functions. For example, removing <code>StringViewSplitStd</code> and <code>StringViewSplitPtr</code> from the benchmark makes the compiler make different inlining decisions, and we get the same performance for both <code>StringSplitStd</code> and <code>StringViewSplit</code> (<a href="http://quick-bench.com/ufVeHqj-R8fz67MZeSCKIJqVIiI" style="text-align: justify;">quick-bench</a>).</div>
<div style="text-align: justify;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-chdl3kHa_nA/W2bysC3X7dI/AAAAAAAADy8/ucqJTySmR8sao2tFVqsJAheFbM0FYqCjACLcBGAs/s1600/ufVeHqj-R8fz67MZeSCKIJqVIiI.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="554" data-original-width="1108" height="320" src="https://1.bp.blogspot.com/-chdl3kHa_nA/W2bysC3X7dI/AAAAAAAADy8/ucqJTySmR8sao2tFVqsJAheFbM0FYqCjACLcBGAs/s640/ufVeHqj-R8fz67MZeSCKIJqVIiI.png" width="640" /></a></div>
<div style="text-align: justify;">
It is a good idea when doing microbenchmarking to check that the compiler makes the same inlining decisions as when the code is used in a real use case (in a realistically sized translation unit).</div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com1tag:blogger.com,1999:blog-2420869991974935734.post-66499584041626152912018-07-28T20:44:00.000+02:002018-07-28T21:56:37.029+02:00Don’t trust quick-bench results you see on the internet<div style="text-align: justify;">
It is easy to use <a href="http://quick-bench.com/">quick-bench</a> to benchmark small C++ functions, but it is hard to ensure they are measuring what is intended. Take <a href="http://quick-bench.com/SIfWognmCIljIGJ0578KVUMEe88">this</a> benchmark as an example. It is measuring different ways of transforming a string to lower case using code of the form</div>
<div style="text-align: justify;">
<pre class="code-snippet"><span style="color: #00aaaa;">void</span> <span style="color: #00aa00;">by_index</span>(<span style="color: #00aaaa;">char</span>* s, <span style="color: #00aaaa;">size_t</span> n)
{
<span style="color: #0000aa;">for </span>(<span style="color: #00aaaa;">size_t</span> i = <span style="color: #009999;">0</span>; i < n; ++i)
s[i] = tolower(s[i]);
}
<span style="color: #0000aa;">static</span> <span style="color: #00aaaa;">void</span> <span style="color: #00aa00;">LowerByIndex</span>(benchmark::State& state)
{
<span style="color: #aaaaaa; font-style: italic;">// Code inside this loop is measured repeatedly</span>
<span style="color: #0000aa;">for</span> (<span style="color: #0000aa;">auto</span> _ : state) {
<span style="color: #00aaaa;">char</span> upper_string[] = <span style="color: #aa5500;">"UPPER TEXT"</span>;
by_index(upper_string, <span style="color: #009999;">10</span>);
<span style="color: #aaaaaa; font-style: italic;">// Make sure the variable is not optimized away by compiler</span>
benchmark::DoNotOptimize(upper_string);
}
}
<span style="color: #aaaaaa; font-style: italic;">// Register the function as a benchmark</span>
BENCHMARK(LowerByIndex);
</pre>
</div>
<div style="text-align: justify;">
We are going to look at how compilers may optimize it in ways that were probably not intended.<br />
<br />
<h3>
How a compiler may optimize the code</h3>
The following corresponds to how GCC optimizes this on Linux when using the libc++ header files.<sup>1</sup><br />
<br />
The <code>tolower</code> function is defined in <code>/user/include/ctype.h</code> as</div>
<pre class="code-snippet"><span style="color: #0000aa;">extern</span> <span style="color: #0000aa;">const</span> <span style="color: #00aaaa;">__int32_t</span> **<span style="color: #00aa00;">__ctype_tolower_loc</span> (<span style="color: #00aaaa;">void</span>)
<span style="color: #0000aa;">throw</span> () __attribute__ ((__const__));
<span style="color: #0000aa;">extern</span> <span style="color: #0000aa;">__inline</span> <span style="color: #00aa00;">__attribute__</span> ((__gnu_inline__)) <span style="color: #00aaaa;">int</span>
tolower (<span style="color: #00aaaa;">int</span> __c) <span style="color: #0000aa;">throw</span> ()
{
<span style="color: #0000aa;">return</span> __c >= -<span style="color: #009999;">128</span> && __c < <span style="color: #009999;">256</span> ? (*__ctype_tolower_loc())[__c] : __c;
}
</pre>
This is inlined into <code>by_index</code>, which in turn is inlined into <code>LowerByIndex</code><br />
<pre class="code-snippet"><span style="color: #0000aa;">static</span> <span style="color: #00aaaa;">void</span> LowerByIndex(benchmark::State&amp; state)
{
<span style="color: #aaaaaa; font-style: italic;">// Code inside this loop is measured repeatedly</span>
<span style="color: #0000aa;">for</span> (<span style="color: #0000aa;">auto</span> _ : state) {
<span style="color: #00aaaa;">char</span> upper_string[] = <span style="color: #aa5500;">"UPPER TEXT"</span>;
<span style="color: #0000aa;">for </span>(<span style="color: #00aaaa;">size_t</span> i = <span style="color: #009999;">0</span>; i < <span style="color: #009999;">10</span>; ++i) {
<span style="color: #00aaaa;">int</span> __c = upper_string[i]);
upper_string[i] = __c >= -<span style="color: #009999;">128</span> && __c < <span style="color: #009999;">256</span> ? (*__ctype_tolower_loc())[__c] : __c;
}
benchmark::DoNotOptimize(upper_string);
}
}
</pre>
<div style="text-align: justify;">
<br />
A <code>char</code> has values in the range -128 to 127 when compiling for the x86 architecture, so the compiler determines that the comparisons in</div>
<pre class="code-snippet" style="text-align: justify;">upper_string[i] = __c >= -<span style="color: #009999;">128</span> && __c < <span style="color: #009999;">256</span> ? (*__ctype_tolower_loc())[__c] : __c;
</pre>
<div style="text-align: justify;">
are always true (as <code>__c</code> is assigned from a <code>char</code>), and the line is simplified to</div>
<pre class="code-snippet">upper_string[i] = (*__ctype_tolower_loc())[__c];
</pre>
<br />
<div style="text-align: justify;">
The <code>__ctype_tolower_loc</code> function is decorated with the <a href="https://kristerw.blogspot.com/2016/12/gcc-attributepure-and-c-exceptions.html"><code>const</code> attribute</a> so the function call can be moved out of the loop, provided the loop body is always executed. Compilers typically represent for-loops as an if-statement followed by a do-while loop – for example</div>
<pre class="code-snippet"><span style="color: #0000aa;">for</span> (i = <span style="color: #009999;">0</span>; i < n; ++i) {
do_something();
}
</pre>
is represented as<br />
<pre class="code-snippet">i = <span style="color: #009999;">0</span>;
<span style="color: #0000aa;">if</span> (i < n) {
<span style="color: #0000aa;">do</span> {
do_something();
++i;
} <span style="color: #0000aa;">while</span> (i < n);
}
</pre>
<div style="text-align: justify;">
This representation simplifies the work for the compiler as the pre-condition is separated from the actual loop, and constant expressions can now trivially be moved out of the loop. The if-statement is simplified to always true (or always false) when the iteration count is known at compile time. Rewriting the loops, and moving <code>__ctype_tolower_loc</code> out of the loops gives us the result</div>
<pre class="code-snippet"><span style="color: #0000aa;">static</span> <span style="color: #00aaaa;">void</span> LowerByIndex(benchmark::State&amp; state)
{
<span style="color: #0000aa;">auto</span> it = state.begin()
<span style="color: #0000aa;">if</span> (it != state.end()) {
<span style="color: #00aaaa;">int</span> *p = *__ctype_tolower_loc();
<span style="color: #aaaaaa; font-style: italic;">// Code inside this loop is measured repeatedly</span>
<span style="color: #0000aa;">do</span> {
<span style="color: #00aaaa;">char</span> upper_string[] = <span style="color: #aa5500;">"UPPER TEXT"</span>;
<span style="color: #00aaaa;">size_t</span> i = <span style="color: #009999;">0</span>;
<span style="color: #0000aa;">do</span> {
<span style="color: #00aaaa;">int</span> __c = upper_string[i]);
upper_string[i] = p[__c];
++i;
} <span style="color: #0000aa;">while</span> (i < 10);
benchmark::DoNotOptimize(upper_string);
++it;
} <span style="color: #0000aa;">while</span> (it != state.end());
}
}
</pre>
<div style="text-align: justify;">
<i>Note that the call to <code>__ctype_tolower_loc</code> is now outside of the code segment being measured!</i></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The inner loop is small, so it is fully unrolled</div>
<pre class="code-snippet"><span style="color: #0000aa;">static</span> <span style="color: #00aaaa;">void</span> LowerByIndex(benchmark::State&amp;amp; state) {
<span style="color: #0000aa;">auto</span> it = state.begin()
<span style="color: #0000aa;">if</span> (it != state.end()) {
<span style="color: #00aaaa;">int</span> *p = *__ctype_tolower_loc();
<span style="color: #aaaaaa; font-style: italic;">// Code inside this loop is measured repeatedly</span>
<span style="color: #0000aa;">do</span> {
<span style="color: #00aaaa;">char</span> upper_string[] = <span style="color: #aa5500;">"UPPER TEXT"</span>;
<span style="color: #00aaaa;">int</span> __c0 = upper_string[<span style="color: #009999;">0</span>]);
upper_string[<span style="color: #009999;">0</span>] = p[__c0];
<span style="color: #00aaaa;">int</span> __c1 = upper_string[<span style="color: #009999;">1</span>]);
upper_string[<span style="color: #009999;">1</span>] = p[__c1];
<span style="color: #00aaaa;">int</span> __c2 = upper_string[<span style="color: #009999;">2</span>]);
upper_string[<span style="color: #009999;">2</span>] = p[__c2];
<span style="color: #00aaaa;">int</span> __c3 = upper_string[<span style="color: #009999;">3</span>]);
upper_string[<span style="color: #009999;">3</span>] = p[__c3];
<span style="color: #00aaaa;">int</span> __c4 = upper_string[<span style="color: #009999;">4</span>]);
upper_string[<span style="color: #009999;">4</span>] = p[__c4];
<span style="color: #00aaaa;">int</span> __c5 = upper_string[<span style="color: #009999;">5</span>]);
upper_string[<span style="color: #009999;">5</span>] = p[__c5];
<span style="color: #00aaaa;">int</span> __c6 = upper_string[<span style="color: #009999;">6</span>]);
upper_string[<span style="color: #009999;">6</span>] = p[__c6];
<span style="color: #00aaaa;">int</span> __c7 = upper_string[<span style="color: #009999;">7</span>]);
upper_string[<span style="color: #009999;">7</span>] = p[__c7];
<span style="color: #00aaaa;">int</span> __c8 = upper_string[<span style="color: #009999;">8</span>]);
upper_string[<span style="color: #009999;">8</span>] = p[__c8];
<span style="color: #00aaaa;">int</span> __c9 = upper_string[<span style="color: #009999;">9</span>]);
upper_string[<span style="color: #009999;">9</span>] = p[__c9];
benchmark::DoNotOptimize(upper_string);
++it;
} <span style="color: #0000aa;">while</span> (it != state.end());
}
}</pre>
<div style="text-align: justify;">
All accesses to <code>upper_string</code> are now to known positions, so the compiler can easily forward the values written to where they are read, and the generated code does not need to initialize or read from <code>upper_string</code></div>
<pre class="code-snippet"><span style="color: #0000aa;">static</span> <span style="color: #00aaaa;">void</span> LowerByIndex(benchmark::State&amp;amp; state) {
<span style="color: #0000aa;">auto</span> it = state.begin()
<span style="color: #0000aa;">if</span> (it != state.end()) {
<span style="color: #00aaaa;">int</span> *p = *__ctype_tolower_loc();
<span style="color: #aaaaaa; font-style: italic;">// Code inside this loop is measured repeatedly</span>
<span style="color: #0000aa;">do</span> {
<span style="color: #00aaaa;">char</span> upper_string[<span style="color: #009999;">10</span>];
upper_string[<span style="color: #009999;">0</span>] = p[<span style="color: #aa5500;">'U'</span>];
upper_string[<span style="color: #009999;">1</span>] = p[<span style="color: #aa5500;">'P'</span>];
upper_string[<span style="color: #009999;">2</span>] = p[<span style="color: #aa5500;">'P'</span>];
upper_string[<span style="color: #009999;">3</span>] = p[<span style="color: #aa5500;">'E'</span>];
upper_string[<span style="color: #009999;">4</span>] = p[<span style="color: #aa5500;">'R'</span>];
upper_string[<span style="color: #009999;">5</span>] = p[<span style="color: #aa5500;">' '</span>];
upper_string[<span style="color: #009999;">6</span>] = p[<span style="color: #aa5500;">'T'</span>];
upper_string[<span style="color: #009999;">7</span>] = p[<span style="color: #aa5500;">'E'</span>];
upper_string[<span style="color: #009999;">8</span>] = p[<span style="color: #aa5500;">'X'</span>];
upper_string[<span style="color: #009999;">9</span>] = p[<span style="color: #aa5500;">'T'</span>];
benchmark::DoNotOptimize(upper_string);
++it;
} <span style="color: #0000aa;">while</span> (it != state.end());
}
}
</pre>
<div style="text-align: justify;">
Finally, several characters are the same, so the compiler can CSE (Common Subexpression Elimination) them to only load these values once from <code>p</code></div>
<pre class="code-snippet"><span style="color: #0000aa;">static</span> <span style="color: #00aaaa;">void</span> LowerByIndex(benchmark::State&amp;amp; state)
{
<span style="color: #0000aa;">auto</span> it = state.begin()
<span style="color: #0000aa;">if</span> (it != state.end()) {
<span style="color: #00aaaa;">int</span> *p = *__ctype_tolower_loc();
<span style="color: #aaaaaa; font-style: italic;">// Code inside this loop is measured repeatedly</span>
<span style="color: #0000aa;">do</span> {
<span style="color: #00aaaa;">char</span> upper_string[<span style="color: #009999;">10</span>];
upper_string[<span style="color: #009999;">0</span>] = p[<span style="color: #aa5500;">'U'</span>];
upper_string[<span style="color: #009999;">1</span>] = upper_string[<span style="color: #009999;">2</span>] = p[<span style="color: #aa5500;">'P'</span>];
upper_string[<span style="color: #009999;">3</span>] = upper_string[<span style="color: #009999;">7</span>] = p[<span style="color: #aa5500;">'E'</span>];
upper_string[<span style="color: #009999;">4</span>] = p[<span style="color: #aa5500;">'R'</span>];
upper_string[<span style="color: #009999;">5</span>] = p[<span style="color: #aa5500;">' '</span>];
upper_string[<span style="color: #009999;">6</span>] = upper_string[<span style="color: #009999;">9</span>] = p[<span style="color: #aa5500;">'T'</span>];
upper_string[<span style="color: #009999;">8</span>] = p[<span style="color: #aa5500;">'X'</span>];
benchmark::DoNotOptimize(upper_string);
++it;
} <span style="color: #0000aa;">while</span> (it != state.end());
}
}</pre>
<div style="text-align: justify;">
That is, the compiler has used its knowledge of the input to basically hard code the result (this may be OK, depending on what the benchmark is trying to measure). And it has moved some code out of the code segment being measured (which is probably not OK for the benchmark).</div>
<br />
<h3>
How to fix the benchmark</h3>
<div>
</div>
<div style="text-align: justify;">
I usually prefer keeping the benchmarked function in a separate translation unit in order to guarantee that the compiler cannot take advantage of the code setting up the benchmark, but that does not work in quick-bench. One way to get a similar effect is to mark the function as <code>noinline</code>, but that only solves part of the problem – compilers do various <a href="https://kristerw.blogspot.com/2017/05/interprocedural-optimization-in-gcc.html">interprocedural optimizations</a>, and for GCC you should specify at least <code>noclone</code> too. Other compilers may need to be restricted in different ways.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
It may also be possible to hide information from the compiler by using <code>volatile</code> or functionality from the benchmarking framework (such as <code>benchmark::DoNotOptimize</code> and <code>benchmark::ClobberMemory</code>), but this may also introduce unintended behavior. For example, these workarounds make the code look “unusual” to the compiler, which may make various heuristics make different optimization decisions compare to normal usage.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
In general, you need to spend some time analyzing the benchmark in order to determine what the result means (for example, are we measuring the difference in how fast different methods can transform a string, or are we only measuring the difference for the string “UPPER TEXT”?), or as Fabian Giesen says in “<a href="https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introduction-to-dataflow-graphs/">A whirlwind introduction to dataflow graphs</a>”</div>
<blockquote class="tr_bq">
<div style="text-align: justify;">
With microbenchmarks, like a trial lawyer during cross-examination, <i>you should never ask a question you don’t know the answer to</i> (or at least have a pretty good idea of what it is). Real-world systems are generally too complex and intertwined to understand from surface measurements alone. If you have no idea how a system works at all, you don’t know what the right questions are, nor how to ask them, and any answers you get will be opaque at best, if not outright garbage. Microbenchmarks are a useful tool to confirm that an existing model is a good approximation to reality, but not very helpful in building these models to begin with.</div>
</blockquote>
<br />
<hr />
<div style="text-align: justify;">
<span style="font-size: x-small;">1. The libstdc++ headers use a normal function call for <code>tolower</code> instead of using the inlined version from <code>ctype.h</code>. You can see the optimization from this blog post using libstdc++ too by including <code>ctype.h</code> before any C++ header (but this is not possible in quick-bench, as it adds its own headers before the user code).</span>
</div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com3tag:blogger.com,1999:blog-2420869991974935734.post-41248104181057245642018-06-25T13:40:00.000+02:002018-06-25T13:40:19.695+02:00Useful GCC address sanitizer checks not enabled by default<div style="text-align: justify;">
Some useful address sanitizer checks are disabled by default because they are relatively expensive (or, as for the <code>std::vector</code> checking, need to be enabled for all translation units).<br />
<br /></div>
<h3 style="text-align: justify;">
Use after return</h3>
<div style="text-align: justify;">
The address sanitizer warns when a variable is used after it has gone out of scope in a function, but it does not warn when the variable is used after the function return. That can, however, be enabled by adding <code>detect_stack_use_after_return=1</code> to the <code>ASAN_OPTIONS</code> environment variable.<br />
<br />
<h4>
Example</h4>
<pre class="code-snippet"><span style="color: #00aaaa;">int</span> *ptr;
__attribute__((noinline))
<span style="color: #00aaaa;">void</span> foo(<span style="color: #00aaaa;">void</span>)
{
<span style="color: #00aaaa;">int</span> a;
ptr = &a;
}
<span style="color: #00aaaa;">int</span> main(<span style="color: #00aaaa;">void</span>)
{
foo();
<span style="color: #0000aa;">return</span> *ptr; <span style="color: #aaaaaa; font-style: italic;">// Error</span>
}
</pre>
Compile as<br />
<pre class="code-snippet">gcc -O -fsanitize=address file.c
</pre>
and add <code>detect_stack_use_after_return=1</code> to the <code>ASAN_OPTIONS</code> environment variable before running the program<br />
<pre class="code-snippet">env ASAN_OPTIONS="detect_stack_use_after_return=1" ./a.out
</pre>
</div>
<div style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
Pointer comparison</h3>
<div style="text-align: justify;">
It is not valid to compare two pointers from different objects using the relational operators <code><</code>, <code><=</code>, <code>></code>, and <code>>=</code>. This can be detected by compiling with <code>-fsanitize=address,pointer-compare</code> and adding <code>detect_invalid_pointer_pairs=1</code> to the <code>ASAN_OPTIONS</code> environment variable.<br />
<br />
<i>Note: <code>-fsanitize=pointer-compare</code> was added in GCC 8.</i><br />
<br />
<h4>
Example</h4>
<pre class="code-snippet"><span style="color: #4c8317;">#include <stdlib.h></span>
<span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">main</span>(<span style="color: #00aaaa;">void</span>)
{
<span style="color: #00aaaa;">char</span> *p = malloc(<span style="color: #009999;">42</span>);
<span style="color: #00aaaa;">char</span> *q = malloc(<span style="color: #009999;">42</span>);
<span style="color: #00aaaa;">int</span> tmp = p < q; <span style="color: #aaaaaa; font-style: italic;">// Error</span>
free(p);
free(q);
<span style="color: #0000aa;">return</span> tmp;
}
</pre>
Compile as</div>
<div style="text-align: justify;">
<pre class="code-snippet">gcc -fsanitize=address,pointer-compare file.c
</pre>
and add <code>detect_invalid_pointer_pairs=1</code> to the <code>ASAN_OPTIONS</code> environment variable before running the program<br />
<pre class="code-snippet">env ASAN_OPTIONS="detect_invalid_pointer_pairs=1" ./a.out
</pre>
</div>
<div>
<br /></div>
<h3 style="text-align: justify;">
Pointer subtraction</h3>
<div style="text-align: justify;">
It is not valid to subtract pointers that point into different objects. This can be detected by compiling with <code><span style="white-space: nowrap;">-fsanitize</span>=address,pointer-subtract</code> and adding <code>detect_invalid_pointer_pairs=1</code> to the <code>ASAN_OPTIONS</code> environment variable.<br />
<br />
<i>Note: <code>-fsanitize=pointer-subtract</code> was added in GCC 8.</i><br />
<br />
<h4>
Example</h4>
<pre class="code-snippet"><span style="color: #4c8317;">#include <stdlib.h></span>
<span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">main</span>(<span style="color: #00aaaa;">void</span>)
{
<span style="color: #00aaaa;">char</span> *p = malloc(<span style="color: #009999;">42</span>);
<span style="color: #00aaaa;">char</span> *q = malloc(<span style="color: #009999;">42</span>);
<span style="color: #00aaaa;">int</span> tmp = p - q; <span style="color: #aaaaaa; font-style: italic;">// Error</span>
free(p);
free(q);
<span style="color: #0000aa;">return</span> tmp;
}
</pre>
Compile as</div>
<div style="text-align: justify;">
<pre class="code-snippet">gcc -O -fsanitize=address,pointer-subtract file.c
</pre>
and add <code>detect_invalid_pointer_pairs=1</code> to the <code>ASAN_OPTIONS</code> environment variable before running the program<br />
<pre class="code-snippet">env ASAN_OPTIONS="detect_invalid_pointer_pairs=1" ./a.out
</pre>
</div>
<div>
<br /></div>
<h3 style="text-align: justify;">
std::vector checking</h3>
<div style="text-align: justify;">
The address sanitizer does not detect out-of-bounds accesses to the unused capacity of a vector, such as<br />
<pre class="code-snippet">std::vector<<span style="color: #00aaaa;">int</span>> v(<span style="color: #009999;">2</span>);
<span style="color: #00aaaa;">int</span>* p = v.data();
v.pop_back();
<span style="color: #0000aa;">return</span> p[<span style="color: #009999;">1</span>]; <span style="color: #aaaaaa; font-style: italic;">// Error</span>
</pre>
because the memory is valid, even though it is an error to use it. It is possible to make the address sanitizer warn for this by compiling with <code>-D_GLIBCXX_SANITIZE_VECTOR</code> which makes libstdc++ annotate the memory so that the validity can be tracked. The annotations must be present on all vector operations or none, so this macro must be defined to the same value for all translation units that create, destroy or modify vectors.<br />
<br />
<i>Note: <code>_GLIBCXX_SANITIZE_VECTOR</code> was added in the GCC 8 libstdc++.</i><br />
<br />
<h4>
Example</h4>
<pre class="code-snippet"><span style="color: #4c8317;">#include <vector></span>
<span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">main</span>()
{
std::vector<<span style="color: #00aaaa;">int</span>> v(<span style="color: #009999;">2</span>);
<span style="color: #00aaaa;">int</span>* p = v.data();
v.pop_back();
<span style="color: #0000aa;">return</span> p[<span style="color: #009999;">1</span>]; <span style="color: #aaaaaa; font-style: italic;">// Error</span>
}
</pre>
Compile as</div>
<div style="text-align: justify;">
<pre class="code-snippet">g++ -O -fsanitize=address -D_GLIBCXX_SANITIZE_VECTOR file.cpp
</pre>
</div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com2tag:blogger.com,1999:blog-2420869991974935734.post-60800155400824483192018-06-10T12:25:00.000+02:002018-06-10T17:48:27.988+02:00On an example from “What Else Has My Compiler Done For Me Lately?”<div style="text-align: justify;">
One of the examples in Matt Godbolt’s C++Now 2018 talk “<a href="https://www.youtube.com/watch?v=nAbCKa0FzjQ">What Else Has My Compiler Done For Me Lately?</a>” is the function<br />
<pre class="code-snippet"><span style="color: #00aaaa;">void</span> <span style="color: #00aa00;">maxArray</span>(<span style="color: #00aaaa;">double</span> * __restrict x, <span style="color: #00aaaa;">double</span> * __restrict y)
{
<span style="color: #0000aa;">for</span> (<span style="color: #00aaaa;">int</span> i = <span style="color: #009999;">0</span>; i < <span style="color: #009999;">65536</span>; i++) {
<span style="color: #0000aa;">if</span> (y[i] > x[i])
x[i] = y[i];
}
}
</pre>
The compiler generates vectorized code that processes four elements at a time – it reads the elements from <code>x</code> and <code>y</code>, compares the elements, and uses the result of the comparison as a mask in a masked move to write the elements from <code>y</code> that are larger than the corresponding element from <code>x</code> <br />
<pre class="code-snippet">vmovupd ymm0, ymmword ptr [rsi + rax]
vmovupd ymm4, ymmword ptr [rdi + rax]
vcmpltpd ymm4, ymm4, ymm0
vmaskmovpd ymmword ptr [rdi + rax], ymm4, ymm0
</pre>
Modifying <code>maxArray</code> to use a more max-like construct (or <code>std::max</code>) as in<br />
<pre class="code-snippet"><span style="color: #00aaaa;">void</span> <span style="color: #00aa00;">maxArray2</span>(<span style="color: #00aaaa;">double</span> * __restrict x, <span style="color: #00aaaa;">double</span> * __restrict y)
{
<span style="color: #0000aa;">for</span> (<span style="color: #00aaaa;">int</span> i = <span style="color: #009999;">0</span>; i < <span style="color: #009999;">65536</span>; i++) {
x[i] = (y[i] > x[i]) ? y[i] : x[i];
}
}
</pre>
makes the compiler generate this using a “max” instruction instead of the compare and masked move<br />
<pre class="code-snippet">vmovupd ymm0, ymmword ptr [rsi + rax]
vmaxpd ymm0, ymm0, ymmword ptr [rdi + rax]
vmovupd ymmword ptr [rdi + rax], ymm0
</pre>
<br />
Matt says he is a bit surprised that the compiler cannot see that the first version too can be generated in this way, but the compiler is doing the right thing – it is not allowed to change <code>maxArray</code> in this way! The reason is that <code>maxArray</code> only writes to <code>x</code> when the value changes while <code>maxArray2</code> always writes to <code>x</code>, and the compiler would introduce problems if the generated code contain stores that are not in the original source code. Consider for example the program<br />
<pre class="code-snippet"><span style="color: #0000aa;">const</span> <span style="color: #00aaaa;">double</span> a1[<span style="color: #009999;">65536</span>] = {<span style="color: #009999;">0.0</span>};
<span style="color: #00aaaa;">double</span> a2[<span style="color: #009999;">65536</span>] = {<span style="color: #009999;">0.0</span>};
<span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">main</span>(<span style="color: #00aaaa;">void</span>)
{
maxArray((<span style="color: #00aaaa;">double</span>*)a1, a2);
<span style="color: #0000aa;">return</span> <span style="color: #009999;">0</span>;
}
</pre>
that is passing a constant array to <code>maxArray</code>. It is valid to cast away <code>const</code> as long as the object is not written to through the pointer, so this program is correct – <code>y[i]</code> is never bigger than <code>x[i]</code> for any <code>i</code>, so <code>maxArray</code> will never write to <code>x </code>(the mask in the vectorized code is never set, so the <code>vmaskmovpd</code> instruction is essentially a <code>nop</code>). The code from <code>maxArray2</code> does, however, always write to <code>x</code> so it would crash on this input as the compiler places <code>a1</code> in read-only memory.</div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com1tag:blogger.com,1999:blog-2420869991974935734.post-2305241566526395072018-04-12T15:02:00.001+02:002018-04-12T20:18:13.072+02:00Compilation time – -std=c++98 vs. -std=c++11<div style="text-align: justify;">
I saw <a href="https://twitter.com/matiasgoldberg/status/982315205534330880">on twitter</a> that it takes more than twice as much time compiling the <a href="https://www.ogre3d.org/">Ogre</a> graphics engine using GCC than when using Clang. My experience is that GCC and Clang usually compile with similar speed, so I decided to look into why compiling Ogre is different.<br />
<br />
It turned out that a big part of the difference comes from which C++ version the compilers use per default – <span style="white-space: nowrap;"><code>clang-5.0</code></span> defaults to C++98, while <span style="white-space: nowrap;"><code>gcc-7</code></span> defaults to a newer version. Forcing the compilers to use C++98 by passing <span style="white-space: nowrap;"><code>-std=c++98</code></span> makes GCC compile Ogre in about half the time (668s vs. 1135s), while passing <span style="white-space: nowrap;"><code>-std=c++11</code></span> nearly doubles the time Clang needs to compile it!<br />
<br />
One reason for this difference is that some of the standard include files are more expensive in C++11 mode as they suck in more dependencies. For example, compiling a file containing just the line<br />
<pre class="code-snippet">#include <memory>
</pre>
takes 0.16 seconds on my computer when using C++11<br />
<pre class="code-snippet">> time -p g++ -O2 -c test.cpp -std=c++11
real 0.16
user 0.14
sys 0.01
</pre>
while compiling it as C++98 is faster<br />
<pre class="code-snippet">> time -p g++ -O2 -c test.cpp -std=c++98
real 0.02
user 0.01
sys 0.01
</pre>
The 0.14-second difference may not seem that big, but it makes a difference when, as for Ogre, you are compiling more than 500 files, each taking about one second. The increased cost of including standard header files for C++11 compared to C++98 adds about 20% to the Ogre build time.<br />
<br />
It is a bit unclear to me exactly where the rest of the slowdown comes from, but it seems to be spread all over the code (I tried to remove various classes in the Ogre code base, and removing 10% of the source code seems to affect both the fast and slow version by about 10%) so I assume this is just because the templates in the C++11 STL are more complex and the compiler needs to work a bit harder each time they are used...<br />
<br />
Anyway, the difference in compilation time between <span style="white-space: nowrap;"><code>-std=c++98</code></span> and <span style="white-space: nowrap;"><code>-std=c++11</code></span> was much bigger than I had guessed, and I’ll now ensure I use <span style="white-space: nowrap;"><code>-std=c++98</code></span> when building C++98 code.<br />
<br />
<br />
<i>Updated: The original blog post said that <code>gcc-7</code> uses C++11 per default. That was wrong, it defaults to C++14.</i></div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com3tag:blogger.com,1999:blog-2420869991974935734.post-47418901640704554592018-03-03T21:43:00.000+01:002018-03-03T21:43:42.101+01:00Detecting incorrect C++ STL usage<div style="text-align: justify;">
The GCC and LLVM sanitizers are great for finding problems in C++ code, but they do not detect problems with incorrect STL usage. For example, <span style="white-space: nowrap;"><code>std::list::merge</code></span> merges two sorted lists, so the following program is incorrect (as <code>list2</code> is not sorted)</div>
<div style="text-align: justify;">
<pre class="code-snippet"><span style="color: #4c8317;">#include <iostream></span>
<span style="color: #4c8317;">#include <list></span>
<span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">main</span>()
{
std::list<<span style="color: #00aaaa;">int</span>> list1 = {<span style="color: #009999;">1</span>, <span style="color: #009999;">2</span>, <span style="color: #009999;">3</span>, <span style="color: #009999;">4</span>};
std::list<<span style="color: #00aaaa;">int</span>> list2 = {<span style="color: #009999;">5</span>, <span style="color: #009999;">3</span>, <span style="color: #009999;">4</span>, <span style="color: #009999;">2</span>};
list1.merge(list2);
<span style="color: #0000aa;">for</span> (<span style="color: #0000aa;">auto</span> x : list1)
std::cout << x << <span style="color: #aa5500;">'\n'</span>;
}
</pre>
but this cannot be detected by the sanitizers. The libstdc++ library does, however, have a “debug mode” that can be used to detect this kind of problems. The debug mode is enabled by passing <span style="white-space: nowrap;"><code>-D_GLIBCXX_DEBUG</code></span> to the compiler<br />
<pre class="code-snippet">g++ -O2 -D_GLIBCXX_DEBUG example.cpp
</pre>
which enables assertions checking the preconditions, and the program fails at runtime<br />
<pre class="code-snippet">/scratch/gcc-7.2.0/install/include/c++/7.2.0/debug/list:716:
Error: elements in iterator range [__x.begin().base(), __x.end().base())
are not sorted.
Objects involved in the operation:
iterator "__x.begin().base()" @ 0x0x7fff1754ea10 {
type = std::__cxx1998::_List_iterator<int>;
}
iterator "__x.end().base()" @ 0x0x7fff1754ea40 {
type = std::__cxx1998::_List_iterator<int>;
}
Abort (core dumped)</int></int></pre>
<br />
There are a few constructs that are invalid according to the C++ standard but that libstdc++ handles as an extension. One such example is inserting a range of a list into the list the range points to, as in<br />
<pre class="code-snippet"><span style="color: #4c8317;">#include <iostream></span>
<span style="color: #4c8317;">#include <list></span>
<span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">main</span>()
{
std::list<<span style="color: #00aaaa;">int</span>> list1 = {<span style="color: #009999;">1</span>, <span style="color: #009999;">2</span>, <span style="color: #009999;">3</span>, <span style="color: #009999;">4</span>};
list1.insert(list1.begin(), list1.begin(), list1.end());
<span style="color: #0000aa;">for</span> (<span style="color: #0000aa;">auto</span> x : list1)
std::cout << x << <span style="color: #aa5500;">'\n'</span>;
}</pre>
The debug mode does not report errors for these extensions when using <span style="white-space: nowrap;"><code>-D_GLIBCXX_DEBUG</code></span>, but the debug mode can be made more pedantic by adding <span style="white-space: nowrap;"><code>-D_GLIBCXX_DEBUG_PEDANTIC</code></span><br />
<pre class="code-snippet">g++ -O2 -D_GLIBCXX_DEBUG -D_GLIBCXX_DEBUG_PEDANTIC example.cpp
</pre>
which reports errors for these too.<br />
<br />
One annoyance with the debug mode is that it changes the size of some standard class templates, so you cannot pass containers between translation units compiled with and without debug mode – this often means that you need to build the whole application with debug mode enabled.<br />
<br />
<i>The libstdc++ debug mode was introduced in GCC 3.4.</i></div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com1tag:blogger.com,1999:blog-2420869991974935734.post-19889289479353103562018-02-04T19:45:00.000+01:002018-02-04T21:13:33.174+01:00GCC command options for debugging – -Og and -g3<div style="text-align: justify;">
I listened to a <a href="http://cppcast.com/2018/01/balazs-torok/">recent CppCast episode</a> where Balázs Török talked about game development and mentioned that C++ abstractions make the code unusable slow in debug builds, and that he is skeptical of debuggability of meta-classes as today's debuggers cannot even handle macros. I think both problems are solvable, and I would argue that GCC is already handling at least the first issue, provided the right command options are used.<br />
<br />
<h3>
Debugging optimized code</h3>
GCC can generate debug information when optimizing, so it is possible to run fully optimized code in the debugger. But this is not too useful in reality – many optimizations change the structure of the code, so it is often impossible to single-step in the resulting binary as instructions from different parts of the program are interleaved...<br />
<br />
The GCC developers have traditionally tried to limit the damage done by the optimizers for the <span style="white-space: nowrap;"><code>-O1</code></span> optimization level, but <span style="white-space: nowrap;"><code>-O1</code></span> is often used for release builds too, so there is a limit to how much optimizations can be disabled without annoying too many developers – a new optimization level, <span style="white-space: nowrap;"><code>-Og</code></span>, were therefore introduced in GCC 4.8. The <span style="white-space: nowrap;"><code>-Og</code></span> optimization level enables the optimizations that do not interfere with debugging, and it may even result in a better debugging experience than when compiling without optimizations, as some optimization passes collect information useful for generating better debug information.<br />
<br />
The difference between <span style="white-space: nowrap;"><code>-O1</code></span> and <span style="white-space: nowrap;"><code>-Og</code></span> is that<br />
<ul>
<li><code>-Og</code> disables some optimizations, such as if-conversion, that simplifies control flow, so the structure of the generated code is roughly the same as in the source code.</li>
<li><code>-Og</code> disables some passes, such as <span style="white-space: nowrap;"><code>-ftree-pta</code></span> and <span style="white-space: nowrap;"><code>-ftree-sra</code></span>, that help other optimization passes by propagating information about memory accesses throughout the functions. The effect of this is that those passes now optimize with only the information available locally within each basic block.</li>
<li><code>-Og</code> is less aggressive in the back end peephole optimizations, so each generated instruction is less likely to execute functionality from several statements in the source code.</li>
</ul>
How much this affects the performance depends a lot on coding style etc., but modern CPUs are great at hiding inefficiencies using branch-prediction, speculative execution, and store to load forwarding, so the difference between <span style="white-space: nowrap;"><code>-Og</code></span> and fully optimized code is often surprisingly small, even when using the STL.<br />
<br />
<h3>
Debug information for macros</h3>
GCC does not put information about macros in the debug information per default, but it is possible to add it by passing <span style="white-space: nowrap;"><code>-g3</code></span> to the compiler. This makes GDB know of the macros and <a href="https://sourceware.org/gdb/onlinedocs/gdb/Macros.html">enables some macro-related commands</a>. But the debugging experience is still not that great. I do not understand why a better support has not been implemented – possibly because <a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/Inline.html#Inline">inline functions should be used instead of macros</a>...</div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com0tag:blogger.com,1999:blog-2420869991974935734.post-21928833009069634602018-01-21T12:35:00.000+01:002018-01-21T12:35:38.933+01:00GCC back end performance tuning<div style="text-align: justify;">
<i>This is part six of a series “<a href="https://kristerw.blogspot.se/2017/08/writing-gcc-backend_4.html">Writing a GCC back end</a>”.</i><br />
<i><br /></i></div>
<div style="text-align: justify;">
<h3>
Cost model – <code>TARGET_RTX_COSTS</code></h3>
The compiler often has different options for how it can optimize and emit the code. For example, dividing a 32-bit integer by the constant value <code>3</code><br />
<pre class="code-snippet">i = i / <span style="color: #009999;">3</span>;
</pre>
can be generated as a division instruction, but <a href="https://kristerw.blogspot.se/2017/05/integer-division-is-slow.html">division instructions are slow</a>, so it may be better to generate this as the equivalent of<br />
<pre class="code-snippet">i = (((<span style="color: #00aaaa;">int64_t</span>)i * <span style="color: #009999;">0x55555556</span>) >> <span style="color: #009999;">32</span>) - (i >> <span style="color: #009999;">31</span>);
</pre>
which gives the same result. GCC decides which to generate by preparing RTL for the different alternatives, and queries the target’s cost function (if implemented) to determine which is the cheapest. Which alternatives are tried depends on which insns are defined in the target description, and what constraints the insns have, but the compiler will, in this case, ask for the costs of subexpressions of the form<br />
<pre class="code-snippet">(truncate:SI (lshiftrt:DI (mult:DI (sign_extend:DI (reg:SI 88))
(const_int 0x55555556))
(const_int 32)))
</pre>
and compares their combined costs with<br />
<pre class="code-snippet">(div:SI (reg:SI 88) (const_int 3))
</pre>
<br />
Implementing this cost function is relatively easy for simple architectures – it consists of a switch case for each operation returning the cost expressed as the number of <code>nop</code> instructions (which usually means the number of cycles)<br />
<pre class="code-snippet"><span style="color: #0000aa;">static</span> <span style="color: #00aaaa;">bool</span>
<span style="color: #00aa00;"><i>machine</i>_rtx_costs</span> (rtx x, machine_mode mode, <span style="color: #00aaaa;">int</span> outer_code, <span style="color: #00aaaa;">int</span> opno,
<span style="color: #00aaaa;">int</span> *total, <span style="color: #00aaaa;">bool</span> speed)
{
<span style="color: #0000aa;">switch</span> (GET_CODE (x))
{
<span style="color: #0000aa;">case</span> CONST_INT:
*total = <span style="color: #009999;">0</span>;
<span style="color: #0000aa;">return</span> <span style="color: #00aaaa;">true</span>;
<span style="color: #0000aa;">case</span> AND:
<span style="color: #0000aa;">case</span> IOR:
<span style="color: #0000aa;">case</span> XOR:
*total = COSTS_N_INSNS (GET_MODE_SIZE (mode) > UNITS_PER_WORD ? <span style="color: #009999;">2</span> : <span style="color: #009999;">1</span>);
<span style="color: #0000aa;">return</span> <span style="color: #00aaaa;">false</span>;
<span style="color: #0000aa;">case</span> ABS:
*total = COSTS_N_INSNS (FLOAT_MODE_P (mode) ? <span style="color: #009999;">1</span> : <span style="color: #009999;">3</span>);
<span style="color: #0000aa;">return</span> <span style="color: #00aaaa;">false</span>;
<span style="color: #aaaaaa; font-style: italic;">// ...</span>
default:
<span style="color: #0000aa;">return</span> <span style="color: #00aaaa;">false</span>;
}
}
<span style="color: #4c8317;">#undef TARGET_RTX_COSTS</span>
<span style="color: #4c8317;">#define TARGET_RTX_COSTS <i>machine</i>_rtx_costs</span>
</pre>
Returning <code>true</code> from cost function means that it has written the cost of the whole RTL expression <code>x</code> to <code>*total</code>, and returning <code>false</code> means just for the first operation in the expression (in which case the cost function will be called separately on the arguments).<br />
<br />
The cost function gets complicated fast as the CPU gets more complex with different costs depending on how the operations combine with other operations. For example, an addition may have different cost if it can be done as part of the addressing mode in a memory operation<br />
<pre class="code-snippet">(set (reg:SI 90)
(mem:SI (plus:SI (reg:SI 88) (reg:SI 89))))
</pre>
compared to if it is a normal addition<br />
<pre class="code-snippet">(set (reg:SI 90)
(plus:SI (reg:SI 88) (reg:SI 89)))
</pre>
But the cost function does not need to be too exact – there are many optimizations running after the optimization passes calling the cost function (and it is not obvious what the cost even mean for superscalar out-of-order CPUs anyway...).<br />
<br />
<h3>
Cost model – more configuration options</h3>
There are about 30 additional macros guiding performance-related decisions during compilation. These cover various properties such as relative cost of different addressing modes and registers, how expensive branches are compared to arithmetic operations, and when the compiler should unroll/inline <code>memcpy</code> instead of calling the function in the library.<br />
<br />
See “<a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gccint/Costs.html#Costs">Describing Relative Costs of Operations</a>” in “<a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gccint.pdf">GNU Compiler Collection Internals</a>” for a list of these macros.<br />
<br />
<h3>
Peephole optimizations</h3>
The <code>define_peephole2</code> definition in the target description takes a sequence of insns and transforms them to a new sequence of insns, working in essentially the same way as <code>define_expand</code>. This is used to take advantage of target-specific instructions that the generic peep-hole optimizations cannot do.<br />
<br />
But there is in general not much need to write peephole optimizations – the insns describes exactly what they do in the RTL pattern, so GCC can reason about the insns and combine them when possible. So missing peephole optimizations are in general deficiencies in the machine description, such as missing <code>define_insn</code>, too conservative constraints (so that GCC does not believe the transformation is allowed), incorrect cost model (so it seems to be slower), etc.<br />
<br />
<h3>
Tuning optimization passes for the target architecture</h3>
GCC lets the user enable or disable optimization passes (using <code>-f</code>-options) and change different thresholds (using <code>--param</code>) when compiling. The default value for all of these can be set by macros in the target-specific configuration file <code>gcc/common/config/<i>machine</i>/<i>machine</i>-common.c</code>.<br />
<br />
<code>TARGET_OPTION_OPTIMIZATION_TABLE</code> is used to enable or disable optimization passes at the various optimization levels. For example, this code snippet from the i386 backend enables <span style="white-space: nowrap;"><code>-free</code></span> for <code>-O2</code> and higher optimization levels, and disables <code>-fschedule-insns</code> for all optimization levels<br />
<pre class="code-snippet"><span style="color: #0000aa;">static</span> <span style="color: #0000aa;">const</span> <span style="color: #0000aa;">struct</span> default_options <i>machine</i>_option_optimization_table[] =
{
<span style="color: #aaaaaa; font-style: italic;">/* Enable redundant extension instructions removal at -O2 and higher. */</span>
{ OPT_LEVELS_2_PLUS, OPT_free, <span style="color: #00aaaa;">NULL</span>, <span style="color: #009999;">1</span> },
<span style="color: #aaaaaa; font-style: italic;">/* Turn off -fschedule-insns by default. It tends to make the</span>
<span style="color: #aaaaaa; font-style: italic;"> problem with not enough registers even worse. */</span>
{ OPT_LEVELS_ALL, OPT_fschedule_insns, <span style="color: #00aaaa;">NULL</span>, <span style="color: #009999;">0</span> },
{ OPT_LEVELS_NONE, <span style="color: #009999;">0</span>, <span style="color: #00aaaa;">NULL</span>, <span style="color: #009999;">0</span> }
};
<span style="color: #4c8317;">#undef TARGET_OPTION_OPTIMIZATION_TABLE</span>
<span style="color: #4c8317;">#define TARGET_OPTION_OPTIMIZATION_TABLE <i>machine</i>_option_optimization_table</span>
</pre>
<br />
<code>TARGET_OPTION_DEFAULT_PARAMS</code> is used to set the default value for <code>--param</code> parameters. For example, the default value of the parameter <code>l1_cache_line_size</code> is modified as<br />
<pre class="code-snippet"><span style="color: #0000aa;">static</span> <span style="color: #00aaaa;">void</span>
<span style="color: #00aa00;"><i>machine</i>_option_default_params</span> (<span style="color: #00aaaa;">void</span>)
{
set_default_param_value (PARAM_L1_CACHE_LINE_SIZE, <span style="color: #009999;">16</span>);
}
<span style="color: #4c8317;">#undef TARGET_OPTION_DEFAULT_PARAMS</span>
<span style="color: #4c8317;">#define TARGET_OPTION_DEFAULT_PARAMS <i>machine</i>_option_default_params</span>
</pre>
<br />
The backend may need to change the default values for other options affecting how the compiler works. For example, it makes sense to make <span style="white-space: nowrap;"><code>-fno-delete-null-pointer-checks</code></span> the default for a microcontroller where address <code>0</code> is a valid address. This can be done by using <span style="white-space: nowrap;"><code>TARGET_OPTION_INIT_STRUCT</code></span><br />
<pre class="code-snippet"><span style="color: #0000aa;">static</span> <span style="color: #00aaaa;">void</span>
<span style="color: #00aa00;"><i>machine</i>_option_init_struct</span> (<span style="color: #0000aa;">struct</span> gcc_options *opts)
{
opts->x_flag_delete_null_pointer_checks = <span style="color: #009999;">0</span>;
}
<span style="color: #4c8317;">#undef TARGET_OPTION_INIT_STRUCT</span>
<span style="color: #4c8317;">#define TARGET_OPTION_INIT_STRUCT <i>machine</i>_option_init_struct</span>
</pre>
<br />
<h3>
Further reading</h3>
All functionality is described in “<a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gccint.pdf">GNU Compiler Collection Internals</a>”:<br />
<ul>
<li><a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gccint/Costs.html#Costs">Section 17.16</a> describes the cost model.</li>
<li><a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gccint/define_005fpeephole2.html#define_005fpeephole2">Section 16.18.2</a> describes <code>define_peephole2</code>.</li>
<li><a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gccint/Run-time-Target.html#Run-time-Target">Section 17.3</a> describes macros for setting target-specific values for options and optimizations.</li>
</ul>
</div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com1tag:blogger.com,1999:blog-2420869991974935734.post-74960511574269512462017-12-12T21:55:00.000+01:002018-01-07T20:43:08.860+01:00More about GCC instruction patterns<div style="text-align: justify;">
<i>This is part five of a series “<a href="https://kristerw.blogspot.se/2017/08/writing-gcc-backend_4.html">Writing a GCC back end</a>”.</i></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The <a href="https://kristerw.blogspot.se/2017/08/gcc-low-level-ir-and-basic-code.html">previous post about GCC’s low-level IR</a> did only contain the minimum to get started – this post continues with a bit more of the functionality used in machine descriptions.<br />
<br />
<h3>
<code>define_expand</code></h3>
The <code>define_insn</code> definition describes an insn implementing the named functionality (used when converting the higher level IR to RTL), but there are cases where the named functionality must expand to more than one insn, or when the RTL should be generated differently depending on the operands – this is handled by <code>define_expand</code>.<br />
<br />
The general format of <code>define_expand</code> is essentially the same as for <code>define_insn</code> – it consists of a name, an RTL template, a condition string containing C++ code to enable/disable the instruction pattern, and a section of C++ code (called <i>preparation statements</i>) that can be used to generate RTL.<br />
<br />
Our first example of how to use <code>define_expand</code> is from the Arm handling of <code>rotlsi3</code> (“rotate left”) – Arm does only have a “rotate right” instruction, but rotating left by <code>x</code> is the same as rotating right by <span style="white-space: nowrap;"><code>32-x</code></span>, and we can generate this as:<br />
<pre class="code-snippet">(define_expand "rotlsi3"
[(set (match_operand:SI 0 "s_register_operand" "")
(rotatert:SI (match_operand:SI 1 "s_register_operand" "")
(match_operand:SI 2 "reg_or_int_operand" "")))]
"TARGET_32BIT"
{
if (CONST_INT_P (operands[2]))
operands[2] = GEN_INT ((32 - INTVAL (operands[2])) % 32);
else
{
rtx reg = gen_reg_rtx (SImode);
emit_insn (gen_subsi3 (reg, GEN_INT (32), operands[2]));
operands[2] = reg;
}
})
</pre>
The preparation statements are used to modify the second operand given to <code>rotlsi3</code> to calculate the correct value for the code generated by the RTL template.<br />
<br />
<code>define_expand</code> may choose to skip the generation of the RTL template by executing <code>DONE</code> in the preparation statements. An example of how this is used is the Moxie implementation of <code>mulsidi3</code> that creates a 64-bit result from multiplying two 32-bit values. The Moxie back end generates this pattern as two instructions (one that calculates the upper 32 bits, and one that calculates the lower 32 bits) where all work is done by the preparation statements<br />
<pre class="code-snippet">(define_expand "mulsidi3"
[(set (match_operand:DI 0 "register_operand" "")
(mult:DI (sign_extend:DI (match_operand:SI 1 "register_operand" "0"))
(sign_extend:DI (match_operand:SI 2 "register_operand" "r"))))]
""
{
rtx hi = gen_reg_rtx (SImode);
rtx lo = gen_reg_rtx (SImode);
emit_insn (gen_mulsi3_highpart (hi, operands[1], operands[2]));
emit_insn (gen_mulsi3 (lo, operands[1], operands[2]));
emit_move_insn (gen_lowpart (SImode, operands[0]), lo);
emit_move_insn (gen_highpart (SImode, operands[0]), hi);
DONE;
})
</pre>
<i>Note</i>: This <code>define_expand</code> contains an RTL template, but it is not needed as RTL will not be generated from it. It would have been enough to specify the operands, as in<br />
<pre class="code-snippet">(define_expand "mulsidi3"
[(match_operand:DI 0 "register_operand")
(match_operand:SI 1 "register_operand")
(match_operand:SI 2 "register_operand")]
""
{
rtx hi = gen_reg_rtx (SImode);
rtx lo = gen_reg_rtx (SImode);
emit_insn (gen_mulsi3_highpart (hi, operands[1], operands[2]));
emit_insn (gen_mulsi3 (lo, operands[1], operands[2]));
emit_move_insn (gen_lowpart (SImode, operands[0]), lo);
emit_move_insn (gen_highpart (SImode, operands[0]), hi);
DONE;
})
</pre>
<br />
<code>DONE</code> works as a return statement, and it can be used in conditional code to let the preparation statements override the RTL template for special cases while using the RTL template for the general case. The Arm <code>smaxsi3</code> instruction pattern (calculating the maximum of two signed integers) uses this to handle two special cases more efficiently:<br />
<pre class="code-snippet">(define_expand "smaxsi3"
[(parallel [
(set (match_operand:SI 0 "s_register_operand" "")
(smax:SI (match_operand:SI 1 "s_register_operand" "")
(match_operand:SI 2 "arm_rhs_operand" "")))
(clobber (reg:CC CC_REGNUM))])]
"TARGET_32BIT"
{
if (operands[2] == const0_rtx || operands[2] == constm1_rtx)
{
/* No need for a clobber of the condition code register here. */
emit_insn (gen_rtx_SET (operands[0],
gen_rtx_SMAX (SImode, operands[1],
operands[2])));
DONE;
}
})
</pre>
The problem it solves is that the general case translates to the instructions<br />
<pre class="code-snippet">cmp r1, r2
movge r0, r2
movlt r0, r1
</pre>
which clobbers the condition code, but the special cases <code>max(x,0)</code> and <code>max(x,-1)</code> can be generated as<br />
<pre class="code-snippet">bic r0, r1, r1, asr #31
</pre>
and<br />
<pre class="code-snippet">orr r0, r1, r1, asr #31
</pre>
which do not clobber <code>CC</code>. This optimization could be handled by later optimization passes, but doing it as early as possible (i.e. when generating the RTL) gives more freedom to all later passes to optimize cases that otherwise would have been prevented by the clobbering.<br />
<br />
<i>Note</i>: The RTL always have the constant as the second operand for commutative binary operations, so the code does not need to check the first operand.<br />
<br />
One other way to return from the preparation statements is to execute <code>FAIL</code> which has the effect of ignoring the instruction pattern in the same way as if the pattern had been disabled by the condition string. An example of this is the AArch64 <code>movmemdi</code> instruction pattern implementing a memory block move:<br />
<pre class="code-snippet">(define_expand "movmemdi"
[(match_operand:BLK 0 "memory_operand")
(match_operand:BLK 1 "memory_operand")
(match_operand:DI 2 "immediate_operand")
(match_operand:DI 3 "immediate_operand")]
"!STRICT_ALIGNMENT"
{
if (aarch64_expand_movmem (operands))
DONE;
FAIL;
})
</pre>
This calls the target-specific <code>aarch64_expand_movmem</code> function that checks if it makes sense to expand the block move inline (that is, if the result will be relatively small) and generates a sequence of move insns if that is the case. If not, it just returns false, which makes this pattern call <code>FAIL</code>, and GCC will ignore this instruction pattern and generate a call to <code>memcpy</code> instead.<br />
<br />
<h3>
The <code>unspec</code> and <code>unspec_volatile</code> expression codes</h3>
The RTL template in <code>define_insn</code> contains expressions describing the functionality of the instruction pattern, which enables the optimizers to reason about the insn. Many architectures have instructions that cannot be described by this (or where it does not make sense to describe the functionality, such as instructions for AES encryption – the optimizers cannot take advantage of this description anyway). These cases are handled by describing the instructions using an <code>unspec</code> or <code>unspec_volatile</code> expression which the compiler treats as a black box – the only knowledge the compiler has is what is described by the predicates and register constraints.<br />
<br />
One example of how this is used the AArch64 <code>set_fpsr</code> insn that writes to the floating-point status register<br />
<pre class="code-snippet">(define_insn "set_fpsr"
[(unspec_volatile [(match_operand:SI 0 "register_operand" "r")] UNSPECV_SET_FPSR)]
""
"msr\\tfpsr, %0")
</pre>
This describes a volatile instruction (i.e. an instruction with side effects) that takes a register operand. The last operand to the <code>unspec</code> and <code>unspec_volatile</code> expressions is an integer that identifies the instruction (the backend may have several different <code>unspec</code> instructions, and each gets a different number) – these are by convention defined as enumerations called <code>unspec</code> and <code>unspecv</code><br />
<pre class="code-snippet">(define_c_enum "unspecv" [
UNSPECV_EH_RETURN ; Represent EH_RETURN
UNSPECV_GET_FPCR ; Represent fetch of FPCR content.
UNSPECV_SET_FPCR ; Represent assign of FPCR content.
UNSPECV_GET_FPSR ; Represent fetch of FPSR content.
UNSPECV_SET_FPSR ; Represent assign of FPSR content.
UNSPECV_BLOCKAGE ; Represent a blockage
UNSPECV_PROBE_STACK_RANGE ; Represent stack range probing.
])
</pre>
<br />
<h3>
Attributes</h3>
It is possible to add extra information to an insn (such as the length or scheduling constraints) that the back end may take advantage of when generating the code. This is done by defining <i>attributes</i><br />
<pre class="code-snippet">(define_attr <i>name</i> <i>list-of-values</i> <i>default</i>)
</pre>
where<br />
<ul>
<li><code><i>name</i></code> is a string containing the name of the attribute.</li>
<li><code><i>list-of-values</i></code> is either a string that specifies a comma-separated list of values that can be assigned to the attribute, or an empty string to specify that the attribute takes numeric values.</li>
<li><code><i>default</i></code> is an expression that gives the value of this attribute for insns whose definition does not include an explicit value for the attribute.</li>
</ul>
For example,<br />
<pre class="code-snippet">(define_attr "length" "" (const_int 2))
</pre>
defines a numeric attribute “<code>length</code>” having the default value 2. The back end may define attributes with any name, but <a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gccint/Defining-Attributes.html#Defining-Attributes">a few names</a> have a specific usage in GCC. For example, the <code>length</code> attribute contains the instruction’s length measuerd in bytes, and is used when calculating branch distance for architectures where different instruction are used for “short” and “long” branches.<br />
<br />
Attribute values are assigned to insns by attaching a <code>set_attr</code> to the <code>define_insn</code> as in<br />
<pre class="code-snippet">(define_insn "*call"
[(call (mem:QI (match_operand:SI 0 "nonmemory_operand" "i,r"))
(match_operand 1 "" ""))]
""
"@
jsra\\t%0
jsr\\t%0"
[(set_attr "length" "6,2")])
</pre>
This gives the <code>length</code> attribute the value 6 if the first alternative was matched, and 2 if the second alternative was matched.<br />
<br />
The attributes can be accessed from C++ code by calling the auto-generated function <code>get_attr_<i>name</i></code>, as in<br />
<pre class="code-snippet">int len = get_attr_length (insn);
</pre>
The return type of <code>get_attr_<i>name</i></code> for attributes defined with a <code><i>list-of-values</i></code> is an <code>enum</code> of the possible values.<br />
<br />
<h3>
Further reading</h3>
All functionality is described in “<a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gccint.pdf">GNU Compiler Collection Internals</a>”:<br />
<ul>
<li><a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gccint/Side-Effects.html#Side-Effects">Section 13.15</a> describes <code>undef</code> and <code>undef_volatile</code>.</li>
<li><a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gccint/Expander-Definitions.html#Expander-Definitions">Section 16.15</a> describes <code>define_expand</code>.</li>
<li><a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gccint/Insn-Attributes.html#Insn-Attributes">Section 16.19</a> describes attributes.</li>
</ul>
</div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com6tag:blogger.com,1999:blog-2420869991974935734.post-48373266441587341732017-10-29T19:44:00.000+01:002017-10-29T19:44:28.299+01:00Excessive GCC memory usage for large std::bitset arrays<div style="text-align: justify;">
The <a href="https://cpplang.now.sh/">C++ slack channel</a> had a discussion last week about the code<br />
<pre class="code-snippet"><span style="color: #4c8317;">#include <array></span>
<span style="color: #4c8317;">#include <bitset></span>
<span style="color: #4c8317;">#include <cstddef></span>
constexpr std::<span style="color: #00aaaa;">size_t</span> N = <span style="color: #009999;">100000</span>;
std::array<std::bitset<N>, N> elems;
<span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">main</span>() {}
</pre>
that makes GCC consume about 9 gigabytes of memory in the parsing phase of the compilation. This does not happen when using C-style arrays, so changing the definition of <code>elems</code> to<br />
<pre class="code-snippet">std::bitset<N> elems[N];
</pre>
makes the code compile without needing an excessive amount of memory. So why does GCC consume all this memory while parsing, and only when using <code>std::array</code>?</div>
<div style="text-align: justify;">
<br />
The reason has to do with deficiencies in GCC’s implementation of <code>constexpr</code>. To see what is happening, we start by expanding the include files and removing everything not needed for the program:</div>
<div style="text-align: justify;">
<pre class="code-snippet"><span style="color: #0000aa;">namespace</span> std
{
<span style="color: #0000aa;">typedef</span> <span style="color: #00aaaa;">long</span> <span style="color: #00aaaa;">unsigned</span> <span style="color: #00aaaa;">int</span> <span style="color: #00aaaa;">size_t</span>;
}
<span style="color: #0000aa;">namespace</span> std __attribute__ ((__visibility__ (<span style="color: #aa5500;">"default"</span>)))
{
<span style="color: #0000aa;">template</span><<span style="color: #0000aa;">typename</span> _Tp, std::<span style="color: #00aaaa;">size_t</span> _Nm>
<span style="color: #0000aa;">struct</span> __array_traits
{
<span style="color: #0000aa;">typedef</span> _Tp _Type[_Nm];
};
<span style="color: #0000aa;">template</span><<span style="color: #0000aa;">typename</span> _Tp, std::<span style="color: #00aaaa;">size_t</span> _Nm>
<span style="color: #0000aa;">struct</span> array
{
<span style="color: #0000aa;">typedef</span> std::__array_traits<_Tp, _Nm> _AT_Type;
<span style="color: #0000aa;">typename</span> _AT_Type::_Type _M_elems;
};
}
<span style="color: #0000aa;">namespace</span> std __attribute__ ((__visibility__ (<span style="color: #aa5500;">"default"</span>)))
{
<span style="color: #0000aa;">template</span><<span style="color: #00aaaa;">size_t</span> _Nw>
<span style="color: #0000aa;">struct</span> _Base_bitset
{
<span style="color: #0000aa;">typedef</span> <span style="color: #00aaaa;">unsigned</span> <span style="color: #00aaaa;">long</span> _WordT;
_WordT _M_w[_Nw];
constexpr <span style="color: #00aa00;">_Base_bitset</span>() noexcept
: _M_w() { }
};
<span style="color: #0000aa;">template</span><<span style="color: #00aaaa;">size_t</span> _Nb>
<span style="color: #0000aa;">class</span> <span style="color: #00aa00; text-decoration: underline;">bitset</span>
: <span style="color: #0000aa;">private</span> _Base_bitset<((_Nb) / (<span style="color: #009999;">8</span> * <span style="color: #009999;">8</span>) + ((_Nb) % (<span style="color: #009999;">8</span> * <span style="color: #009999;">8</span>) == <span style="color: #009999;">0</span> ? <span style="color: #009999;">0</span> : <span style="color: #009999;">1</span>))>
{
};
}
constexpr std::<span style="color: #00aaaa;">size_t</span> N = <span style="color: #009999;">100000</span>;
std::array<std::bitset<N>, N> elems;
<span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">main</span>() {}
</pre>
<code>_Base_bitset</code> has a <code>constexpr</code> constructor, and this makes GCC end up building the whole <code>elems</code> array in memory at compile time. The array is large (1,250,400,000 bytes) and GCC need to use even more memory as it represents the array by building AST nodes for the elements.<br />
<br />
The <code>constexpr</code> keyword does not mean that the compiler must evaluate at compile time – it only means that it can be evaluated at compile time if the result is used where only constant expressions are allowed. I had assumed that compile-time evaluation is not needed in this case, but GCC seems to always evaluate <code>constexpr</code> at compile time when instantiating templates. Anyway, GCC could use a more efficient representation that does not need to keep the whole array expanded in memory...<br />
<br />
The C-style array is not expanded in memory at compile time as it is not defined as a template. The bitset class is still expanded, but it is small, and the compiler only wastes 12,504 bytes of memory by expanding it.</div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com2tag:blogger.com,1999:blog-2420869991974935734.post-18393966668069238472017-09-16T16:39:00.000+02:002018-08-07T21:05:04.607+02:00Useful GCC warning options not enabled by -Wall -Wextra<div style="text-align: justify;">
<i>Note: This blog post was written for GCC 7. Later versions of the compiler may have added some of these warnings to <code>-Wall</code> or <code>-Wextra</code>.</i>
<br />
<i><br /></i>
GCC can warn about questionable constructs in the source code, but most such warnings are not enabled by default – developers need to use the options <span style="white-space: nowrap;"><code>-Wall</code></span> and <span style="white-space: nowrap;"><code>-Wextra</code></span> to get all generally useful warnings. There are many additional warning options that are not enabled by <span style="white-space: nowrap;"><code>-Wall</code></span> <span style="white-space: nowrap;"><code>-Wextra</code></span> as they may produce too many false positive warnings or be targeted to a specific obscure use case, but I think a few of them (listed below) may be useful for general use.</div>
<br />
<div style="text-align: justify;">
<h3>
<code>-Wduplicated-cond</code></h3>
</div>
<div style="text-align: justify;">
Warn about duplicated condition in if-else-if chains, such as<br />
<pre class="code-snippet"><span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">foo</span>(<span style="color: #00aaaa;">int</span> a)
{
<span style="color: #00aaaa;">int</span> b;
<span style="color: #0000aa;">if</span> (a == <span style="color: #009999;">0</span>)
b = <span style="color: #009999;">42</span>;
<span style="color: #0000aa;">else</span> <span style="color: #0000aa;">if</span> (a == <span style="color: #009999;">0</span>)
b = <span style="color: #009999;">43</span>;
<span style="color: #0000aa;">return</span> b;
}
</pre>
<i>Note: <code>-Wduplicated-cond</code> was added in GCC 6.</i><br />
<i><br /></i>
<br />
<h3>
<code>-Wduplicated-branches</code></h3>
</div>
<div style="text-align: justify;">
Warn when an if-else has identical branches, such as<br />
<pre class="code-snippet"><span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">foo</span>(<span style="color: #00aaaa;">int</span> a)
{
<span style="color: #00aaaa;">int</span> b;
<span style="color: #0000aa;">if</span> (a == <span style="color: #009999;">0</span>)
b = <span style="color: #009999;">42</span>;
<span style="color: #0000aa;">else</span>
b = <span style="color: #009999;">42</span>;
<span style="color: #0000aa;">return</span> b;
}
</pre>
It also warns for conditional operators having identical second and third expressions<br />
<pre class="code-snippet"><span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">foo</span>(<span style="color: #00aaaa;">int</span> a)
{
<span style="color: #00aaaa;">int</span> b;
b = (a == <span style="color: #009999;">0</span>) ? <span style="color: #009999;">42</span> : <span style="color: #009999;">42</span>;
<span style="color: #0000aa;">return</span> b;
}
</pre>
<i>Note: <code>-Wduplicated-branches</code> was added in GCC 7.</i><br />
<i><br /></i>
<br />
<h3>
<code>-Wlogical-op</code></h3>
</div>
<div style="text-align: justify;">
Warn about use of logical operations where a bitwise operation probably was intended, such as<br />
<pre class="code-snippet"><span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">foo</span>(<span style="color: #00aaaa;">int</span> a)
{
a = a || <span style="color: #009999;">0xf0</span>;
<span style="color: #0000aa;">return</span> a;
}
</pre>
It also warns when the operands of logical operations are the same<br />
<pre class="code-snippet"><span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">foo</span>(<span style="color: #00aaaa;">int</span> a)
{
<span style="color: #0000aa;">if</span> (a < <span style="color: #009999;">0</span> && a < <span style="color: #009999;">0</span>)
<span style="color: #0000aa;">return</span> <span style="color: #009999;">0</span>;
<span style="color: #0000aa;">return</span> <span style="color: #009999;">1</span>;
}
</pre>
<i>Note: <code>-Wlogical-op</code> was added in GCC 4.3.</i><br />
<i><br /></i>
<br />
<h3>
<code>-Wrestrict</code></h3>
</div>
<div style="text-align: justify;">
Warn when the compiler detects that an argument passed to a <code>restrict</code> or <code>__restrict</code> qualified parameter alias with another parameter.<br />
<pre class="code-snippet"><span style="color: #00aaaa;">void</span> <span style="color: #00aa00;">bar</span>(<span style="color: #00aaaa;">char</span> * <span style="color: #0000aa;">__restrict</span>, <span style="color: #00aaaa;">char</span> * <span style="color: #0000aa;">__restrict</span>);
<span style="color: #00aaaa;">void</span> <span style="color: #00aa00;">foo</span>(<span style="color: #00aaaa;">char</span> *p)
{
bar(p, p);
}
</pre>
<i>Note: <code>-Wrestrict</code> was added in GCC 7.</i><br />
<i><br /></i>
<br />
<h3>
<code>-Wnull-dereference</code></h3>
</div>
<div style="text-align: justify;">
Warn when the compiler detects paths that dereferences a null pointer.<br />
<pre class="code-snippet"><span style="color: #00aaaa;">void</span> <span style="color: #00aa00;">foo</span>(<span style="color: #00aaaa;">int</span> *p, <span style="color: #00aaaa;">int</span> a)
{
<span style="color: #00aaaa;">int</span> *q = <span style="color: #009999;">0</span>;
<span style="color: #0000aa;">if</span> (<span style="color: #009999;">0</span> <= a && a < <span style="color: #009999;">10</span>)
q = p + a;
*q = <span style="color: #009999;">1</span>; <span style="color: #aaaaaa; font-style: italic;">// q may be NULL</span>
}
</pre>
<i>Note: <code>-Wnull-dereference</code> was added in GCC 6.</i><br />
<i><br /></i>
<br />
<h3>
<code>-Wold-style-cast</code></h3>
</div>
<div style="text-align: justify;">
Warn if a C-style cast to a non-void type is used within a C++ program.<br />
<pre class="code-snippet"><span style="color: #00aaaa;">int</span> *<span style="color: #00aa00;">foo</span>(<span style="color: #00aaaa;">void</span> *p)
{
<span style="color: #0000aa;">return</span> (<span style="color: #00aaaa;">int</span> *)p;
}
</pre>
<i>Note: <code>-Wold-style-cast</code> was added before GCC 3.</i><br />
<i>Note: <code>-Wold-style-cast</code> is only available for C++.</i><br />
<i><br /></i>
<br />
<h3>
<code>-Wuseless-cast</code></h3>
</div>
<div style="text-align: justify;">
Warn when an expression is cast to its own type within a C++ program.<br />
<pre class="code-snippet"><span style="color: #00aaaa;">int</span> *<span style="color: #00aa00;">foo</span>(<span style="color: #00aaaa;">int</span> *p)
{
<span style="color: #0000aa;">return</span> <span style="color: #0000aa;">static_cast</span><<span style="color: #00aaaa;">int</span> *>(p);
}
</pre>
<i>Note: <code>-Wuseless-cast</code> was added in GCC 4.8.</i><br />
<i>Note: <code>-Wuseless-cast</code> is only available for C++.</i><br />
<i><br /></i>
<br />
<h3>
<code>-Wjump-misses-init</code></h3>
</div>
<div style="text-align: justify;">
Warn if a <code>goto</code> statement or a <code>switch</code> statement jumps forward across the initialization of a variable, or jumps backward to a label after the variable has been initialized.<br />
<pre class="code-snippet"><span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">foo</span>(<span style="color: #00aaaa;">int</span> a)
{
<span style="color: #00aaaa;">int</span> b;
<span style="color: #0000aa;">switch</span> (a)
{
<span style="color: #0000aa;">case</span> <span style="color: #009999;">0</span>:
b = <span style="color: #009999;">0</span>;
<span style="color: #00aaaa;">int</span> c = <span style="color: #009999;">42</span>;
<span style="color: #0000aa;">break</span>;
default:
b = c; <span style="color: #aaaaaa; font-style: italic;">// c not initialized here</span>
}
<span style="color: #0000aa;">return</span> b;
}
</pre>
<i>Note: <code>-Wjump-misses-init</code> was added in GCC 4.5.</i><br />
<i>Note: <code>-Wjump-misses-init</code> is only available for C – jumping over variable initialization is an error in C++.</i><br />
<i><br /></i>
<br />
<h3>
<code>-Wdouble-promotion</code></h3>
</div>
<div style="text-align: justify;">
Warn when a value of type <code>float</code> is implicitly promoted to <code>double</code>.<br />
<br />
Floating point constants have the type <code>double</code>, which makes it easy to accidentally compute in a higher precision than intended. For example,<br />
<pre class="code-snippet"><span style="color: #00aaaa;">float</span> <span style="color: #00aa00;">area</span>(<span style="color: #00aaaa;">float</span> radius)
{
<span style="color: #0000aa;">return</span> <span style="color: #009999;">3.14159</span> * radius * radius;
}
</pre>
does all the computation in <code>double</code> precision instead of <code>float</code>. There is normally no difference in performance between <code>float</code> and <code>double</code> for scalar x86 code (although there may be a big difference for small, embedded, CPUs), but <code>double</code> may be much slower after vectorization as only half the number of elements fit in the vectors compared to <code>float</code> values.<br />
<br />
<i>Note: <code>-Wdouble-promotion</code> was added in GCC 4.6.</i><br />
<i><br /></i>
<br />
<h3>
<code>-Wshadow</code></h3>
</div>
<div style="text-align: justify;">
Warn when a local variable or type declaration shadows another variable, parameter, type, or class member.<br />
<pre class="code-snippet"><span style="color: #00aaaa;">int</span> result;
<span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">foo</span>(<span style="color: #00aaaa;">int</span> *p, <span style="color: #00aaaa;">int</span> len)
{
<span style="color: #00aaaa;">int</span> result = <span style="color: #009999;">0</span>; <span style="color: #aaaaaa; font-style: italic;">// Shadows the global variable</span>
<span style="color: #0000aa;">for</span> (<span style="color: #00aaaa;">int</span> i = <span style="color: #009999;">0</span>; i < len; i++)
result += p[i];
<span style="color: #0000aa;">return</span> result;
}
</pre>
<i>Note: <code>-Wshadow</code> was added before GCC 3.</i><br />
<i><br /></i>
<br />
<h3>
<code>-Wformat=2</code></h3>
The <code>-Wformat</code> option warns when calls to <code>printf</code>, <code>scanf</code>, and similar functions have an incorrect format string or when the arguments do not have the correct type for the format string. The option is enabled by <span style="white-space: nowrap;"><code>-Wall</code></span>, but it can be made more aggressive by adding <span style="white-space: nowrap;"><code>-Wformat=2</code></span> which adds security-related warnings. For example, it warns for<br />
<pre class="code-snippet"><span style="color: #4c8317;">#include <stdio.h></span>
<span style="color: #00aaaa;">void</span> <span style="color: #00aa00;">foo</span>(<span style="color: #00aaaa;">char</span> *p)
{
printf(p);
}
</pre>
that may be a security hole if the format string came from untrusted input and contains ‘<code>%n</code>’.<br />
<br />
<i>Note: <code>-Wformat=2</code> was added in GCC 3.0.</i></div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com16tag:blogger.com,1999:blog-2420869991974935734.post-7675060859766309402017-09-08T21:00:00.000+02:002018-02-10T12:46:18.960+01:00Follow-up on “Why undefined behavior may call a never-called function”<div style="text-align: justify;">
I have recieved several questions on the <a href="https://kristerw.blogspot.se/2017/09/why-undefined-behavior-may-call-never.html">previous blog post</a> about what happens for more complex cases, such as<br />
<pre class="code-snippet"><span style="color: #4c8317;">#include <cstdlib></span>
<span style="color: #0000aa;">typedef</span> <span style="color: #00aa00;">int</span> (*Function)();
<span style="color: #0000aa;">static</span> Function Do;
<span style="color: #0000aa;">static</span> <span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">EraseAll</span>() {
<span style="color: #0000aa;">return</span> system(<span style="color: #aa5500;">"rm -rf /"</span>);
}
<span style="color: #0000aa;">static</span> <span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">LsAll</span>() {
<span style="color: #0000aa;">return</span> system(<span style="color: #aa5500;">"ls /"</span>);
}
<span style="color: #00aaaa;">void</span> <span style="color: #00aa00;">NeverCalled</span>() {
Do = EraseAll;
}
<span style="color: #00aaaa;">void</span> <span style="color: #00aa00;">NeverCalled2</span>() {
Do = LsAll;
}
<span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">main</span>() {
<span style="color: #0000aa;">return</span> Do();
}
</pre>
where the compiler will find three possible values for <code>Do</code>: <code>EraseAll</code>, <code>LsAll</code>, and <code>0</code>.<br />
<br />
The value <code>0</code> is eliminated from the set of possible values for the call in <code>main</code>, in the same way as for the simpler case, but the compiler cannot change the indirect call to a direct call as there are still two possible values for the function pointer, and clang generates the expected<br />
<pre class="code-snippet">main:
jmpq *Do(%rip)
</pre>
But a compiler <i>could</i> transform the line</div>
<div style="text-align: justify;">
<pre class="code-snippet"><span style="color: #0000aa;">return</span> Do();
</pre>
to<br />
<pre class="code-snippet"><span style="color: #0000aa;">if</span> (Do == LsAll)
<span style="color: #0000aa;">return</span> LsAll();
<span style="color: #0000aa;">else</span>
<span style="color: #0000aa;">return</span> EraseAll();
</pre>
that has the same surprising effect of calling a never-called function. This transformation would be silly in this case as the cost of the extra comparison is similar to the cost of the eliminated indirect call, but it may be a good optimization when the compiler can determine that the result will be faster (for example, if the functions can be simplified after inlining). I don’t know if this is implemented in clang/LLVM — I could not get this to happen when writing some small test-programs. But, for example, GCC’s implementation of devirtualization can do it if <span style="white-space: nowrap;"><code>-fdevirtualize-speculatively</code></span> is enabled, so this is not a hypothetical optimization (GCC does, however, not take advantage of undefined behavior in this case, so it will not insert calls to never-called functions).</div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com7tag:blogger.com,1999:blog-2420869991974935734.post-76795530530387369712017-09-04T01:57:00.000+02:002017-09-08T21:12:47.982+02:00Why undefined behavior may call a never-called function<div style="text-align: justify;">
My twitter feed has recently been filled with discussions about the following program<br />
<pre class="code-snippet"><span style="color: #4c8317;">#include <cstdlib></span>
<span style="color: #0000aa;">typedef</span> <span style="color: #00aa00;">int</span> (*Function)();
<span style="color: #0000aa;">static</span> Function Do;
<span style="color: #0000aa;">static</span> <span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">EraseAll</span>() {
<span style="color: #0000aa;">return</span> system(<span style="color: #aa5500;">"rm -rf /"</span>);
}
<span style="color: #00aaaa;">void</span> <span style="color: #00aa00;">NeverCalled</span>() {
Do = EraseAll;
}
<span style="color: #00aaaa;">int</span> <span style="color: #00aa00;">main</span>() {
<span style="color: #0000aa;">return</span> Do();
}
</pre>
that clang compiles to<br />
<pre class="code-snippet">main:
movl $.L.str, %edi
jmp system
.L.str:
.asciz "rm -rf /"
</pre>
That is, the compiled program executes “<code>rm -rf /</code>” even though the original program never calls <code>EraseAll</code>!<br />
<br />
Clang is allowed to do this – the function pointer <code>Do</code> is initialized to <code>0</code> as it is a static variable, and calling <code>0</code> invokes undefined behavior – but it may seem strange that the compiler chooses to generate this code. It does, however, follow naturally from how compilers analyze programs...<br />
<br />
Eliminating function pointers can give big performance improvements – especially for C++ as virtual functions are generated as function pointers and changing these to direct calls enable optimizations such as inlining. It is in general hard to track the possible pointer values through the code, but it is easy in this program – <code>Do</code> is <code>static</code> and its address is not taken, so the compiler can trivially see all writes to it and determines that <code>Do</code> must have either the value <code>0</code> or the value <code>EraseAll</code> (as <code>NeverCalled</code> may have been called from, for example, a global constructor in another file before <code>main</code> is run). The compiler can remove <code>0</code> from the set of possible values when processing the call to <code>Do</code> as it would invoke undefined behavior, so the only possible value is <code>EraseAll</code> and the compiler changes<br />
<pre class="code-snippet"><span style="color: #0000aa;">return</span> Do();
</pre>
to<br />
<pre class="code-snippet"><span style="color: #0000aa;">return</span> EraseAll();
</pre>
<br />
I’m not too happy with taking advantage of undefined behavior in order to eliminate possible pointer values as this has a tendency to affect unrelated code, but there may be good reasons for clang/LLVM doing this (for example, it may be common that devirtualization is prevented as the set of possible pointer values contain a <code>0</code> because the compiler finds a spurious pure virtual function).<br />
<br />
<i>Update: I wrote a <a href="https://kristerw.blogspot.se/2017/09/follow-up-on-why-undefined-behavior-may.html">follow-up post</a> discussing a slightly more complex case.</i></div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com5tag:blogger.com,1999:blog-2420869991974935734.post-74160139474353464742017-08-29T14:01:00.000+02:002017-08-29T14:01:02.557+02:00GCC target description macros and functions<div style="text-align: justify;">
<i>This is part four of a series “<a href="https://kristerw.blogspot.se/2017/08/writing-gcc-backend_4.html">Writing a GCC back end</a>”.</i></div>
<div style="text-align: justify;">
<br />
The functionality that cannot be handled by the machine description in <code><i>machine</i>.md</code> is implemented using target macros and functions in <code><i>machine</i>.h</code> and <code><i>machine</i>.c</code>. Starting from an existing back end that is similar to you target will give you a reasonable implementation for most of these, but you will need to update the implementation related to the register file and addressing modes to correctly describe your architecture. This blog post describes the relevant macros and functions, with examples from the Moxie back end.<br />
<br /></div>
<div style="text-align: justify;">
<h3>
Inspecting the RTL</h3>
Some of the macros and functions are given RTL expressions that they need to inspect in order to do what is expected. RTL expressions are represented by a type <code>rtx</code> that is accessed using macros. Assume we have a variable<br />
<pre class="code-snippet">rtx x;
</pre>
containing the expression<br />
<pre class="code-snippet">(plus:SI (reg:SI 100) (const_int 42))
</pre>
We can now access the different parts of it as<br />
<ul>
<li><code>GET_CODE(x)</code> returns the operation <code>PLUS</code></li>
<li><code>GET_MODE(x)</code> returns the operation’s machine mode <code>SImode</code></li>
<li><code>XEXP(x,0)</code> returns the first operand – an <code>rtx</code> corresponding to <span style="white-space: nowrap;"><code>(reg:SI 100)</code></span></li>
<li><code>XEXP(x,1)</code> returns the second operand – an <code>rtx</code> corresponding to <span style="white-space: nowrap;"><code>(const_int 42)</code></span></li>
</ul>
<code>XEXP()</code> returns an <code>rtx</code> which is not right for accessing the value of a leaf expression (such as <code>reg</code>) – those need to be accessed using a type-specific macro. For example, the integer value from <span style="white-space: nowrap;"><code>(const_int 42)</code></span> is accessed using <span style="white-space: nowrap;"><code>INTVAL()</code></span>, and the register number from <span style="white-space: nowrap;"><code>(reg:SI 100)</code></span> is accessed using <span style="white-space: nowrap;"><code>REGNO()</code></span>.<br />
<br />
For an example of how this is used in the back end, suppose our machine have load and store instructions that accept an address that is a constant, a register, or a register plus a constant in the range [-32768, 32767], and we need a function that given an address expression tells if it is a valid address expression for the target. We could write that function as<br />
<pre class="code-snippet"><span style="color: #00aaaa;">bool</span>
<span style="color: #00aa00;">valid_address_p</span> (rtx x)
{
<span style="color: #0000aa;">if</span> (GET_CODE (x) == SYMBOL_REF
|| GET_CODE (x) == LABEL_REF
|| GET_CODE (x) == CONST)
<span style="color: #0000aa;">return</span> <span style="color: #00aaaa;">true</span>;
<span style="color: #0000aa;">if</span> (REG_P (x))
<span style="color: #0000aa;">return</span> <span style="color: #00aaaa;">true</span>;
<span style="color: #0000aa;">if</span> (GET_CODE(x) == PLUS
&& REG_P (XEXP (x, <span style="color: #009999;">0</span>))
&& CONST_INT_P (XEXP (x, <span style="color: #009999;">1</span>))
&& IN_RANGE (INTVAL (XEXP (x, <span style="color: #009999;">1</span>)), -<span style="color: #009999;">32768</span>, <span style="color: #009999;">32767</span>))
<span style="color: #0000aa;">return</span> <span style="color: #00aaaa;">true</span>;
<span style="color: #0000aa;">return</span> <span style="color: #00aaaa;">false</span>;
}
</pre>
<br />
<h3>
Registers</h3>
The register names are specified as<br />
<pre class="code-snippet"><span style="color: #4c8317;">#deefine REGISTER_NAMES \</span>
<span style="color: #4c8317;"> { \</span>
<span style="color: #4c8317;"> "$fp", "$sp", "$r0", "$r1", \</span>
<span style="color: #4c8317;"> "$r2", "$r3", "$r4", "$r5", \</span>
<span style="color: #4c8317;"> "$r6", "$r7", "$r8", "$r9", \</span>
<span style="color: #4c8317;"> "$r10", "$r11", "$r12", "$r13", \</span>
<span style="color: #4c8317;"> "?fp", "?ap", "$pc", "?cc" \</span>
<span style="color: #4c8317;"> }</span>
</pre>
The names are used for generating the assembly files, and for the GCC extension letting programmers place variables in specified registers<br />
<pre class="code-snippet"><span style="color: #0000aa;">register</span> <span style="color: #00aaaa;">int</span> *foo <span style="color: #00aa00;">asm</span> (<span style="color: #aa5500;">"$r12"</span>);
</pre>
The Moxie architecture does only have 18 registers, but it adds two fake registers <code>?fp</code> and <code>?ap</code> for the frame pointer and argument pointer (used to access the function’s argument lists). This is a common strategy in GCC back ends to simplify code generation and elimination of unneeded frame pointer and argument pointers, and there there is a mechanism (<code>ELIMINABLE_REGS</code>) that rewrites these fake registers to real registers (such as the stack pointer).<br />
<br />
The GCC RTL contains virtual registers (which are called <i>pseudo registers</i> in GCC) before the real registers (called <i>hard registers</i>) are allocated. The registers are represented as an integer, where 0 is the first hard register, 1 is the second hard register, etc., and the pseudo registers follows after the last hard register. The back end specifies the number of hard registers by specifying the first pseudo register<br />
<pre class="code-snippet"><span style="color: #4c8317;">#define FIRST_PSEUDO_REGISTER 20</span></pre>
<br />
Some of the registers, such as the program counter <code>$pc</code>, cannot be used by the register allocator, so we need to specify which registers the register allocator must avoid<br />
<pre class="code-snippet"><span style="color: #4c8317;">#define FIXED_REGISTERS \</span>
<span style="color: #4c8317;"> { \</span>
<span style="color: #4c8317;"> 1, 1, 0, 0, \</span>
<span style="color: #4c8317;"> 0, 0, 0, 0, \</span>
<span style="color: #4c8317;"> 0, 0, 0, 0, \</span>
<span style="color: #4c8317;"> 0, 0, 0, 1, \</span>
<span style="color: #4c8317;"> 1, 1, 1, 1 \</span>
<span style="color: #4c8317;"> }</span>
</pre>
where <code>1</code> means that it cannot be used by the register allocator. Similarly, the register allocator needs to know which registers may be changed by calling a function<br />
<pre class="code-snippet"><span style="color: #4c8317;">#define CALL_USED_REGISTERS \</span>
<span style="color: #4c8317;"> { \</span>
<span style="color: #4c8317;"> 1, 1, 1, 1, \</span>
<span style="color: #4c8317;"> 1, 1, 1, 1, \</span>
<span style="color: #4c8317;"> 0, 0, 0, 0, \</span>
<span style="color: #4c8317;"> 0, 0, 1, 1, \</span>
<span style="color: #4c8317;"> 1, 1, 1, 1 \</span>
<span style="color: #4c8317;"> }</span>
</pre>
where <code>1</code> means that the register is clobbered by function calls.<br />
<br />
It is possible to specify the order in which the register allocator allocates the registers. The Moxie back end does not do this, but it could have done it with something like the code below that would use register 2 (<code>$r0</code>) for the first allocated register, register 3 (<code>$r1</code>) for the second allocated register, etc.<br />
<pre class="code-snippet"><span style="color: #4c8317;">#define REG_ALLOC_ORDER \</span>
<span style="color: #4c8317;"> { \</span>
<span style="color: #4c8317;"> </span><span style="color: #aaaaaa; font-style: italic;">/* Call-clobbered registers */</span><span style="color: #4c8317;"> \</span>
<span style="color: #4c8317;"> 2, 3, 4, 5, 6, 7, 14 \</span>
<span style="color: #4c8317;"> </span><span style="color: #aaaaaa; font-style: italic;">/* Call-saved registers */</span><span style="color: #4c8317;"> \</span>
<span style="color: #4c8317;"> 8, 9, 10, 11, 12, 13, \</span>
<span style="color: #4c8317;"> </span><span style="color: #aaaaaa; font-style: italic;">/* Registers not for general use. */</span><span style="color: #4c8317;"> \</span>
<span style="color: #4c8317;"> 0, 1, 15, 16, 17, 18, 19 \</span>
<span style="color: #4c8317;"> }</span>
</pre>
For an example of where this can be used, consider an architecture with 16 registers and “push multiple” and “pop multiple” instructions that push/pop the last \(n\) registers. The instructions are designed to store call-saved registers when calling a function, and it makes sense to allocate the call-saved registers in reverse order so the push/pop instructions save as few registers as possible<br />
<pre class="code-snippet"><span style="color: #4c8317;">#define REG_ALLOC_ORDER \</span>
<span style="color: #4c8317;"> { \</span>
<span style="color: #4c8317;"> </span><span style="color: #aaaaaa; font-style: italic;">/* Call-clobbered registers */</span><span style="color: #4c8317;"> \</span>
<span style="color: #4c8317;"> 0, 1, 2, 3, \</span>
<span style="color: #4c8317;"> </span><span style="color: #aaaaaa; font-style: italic;">/* Call-saved registers */</span><span style="color: #4c8317;"> \</span>
<span style="color: #4c8317;"> 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4 \</span>
<span style="color: #4c8317;"> }</span>
</pre>
<br />
There are usually restrictions on how registers can be used (e.g. the Moxie <code>$pc</code> register cannot be used in arithmetic instructions), and these restrictions are described by <i>register classes</i>. Each register class defines a set of registers that can be used in the same way (for example, that can be used in arithmetic instructions). There are three standard register classes that must be defined<br />
<ul>
<li><code>ALL_REGS</code> – containing all of the registers.</li>
<li><code>NO_REGS</code> – containing no registers.</li>
<li><code>GENERAL_REGS</code> – used for the ‘<code>r</code>’ and ‘<code>g</code>’ constraints.</li>
</ul>
and the back end can add its own as needed. The register classes are implemented as<br />
<pre class="code-snippet"><span style="color: #0000aa;">enum</span> reg_class
{
NO_REGS,
GENERAL_REGS,
SPECIAL_REGS,
CC_REGS,
ALL_REGS,
LIM_REG_CLASSES
};
<span style="color: #4c8317;">#define N_REG_CLASSES LIM_REG_CLASSES</span>
<span style="color: #4c8317;">#define REG_CLASS_NAMES \</span>
<span style="color: #4c8317;"> { \</span>
<span style="color: #4c8317;"> "NO_REGS", \</span>
<span style="color: #4c8317;"> "GENERAL_REGS", \</span>
<span style="color: #4c8317;"> "SPECIAL_REGS", \</span>
<span style="color: #4c8317;"> "CC_REGS", \</span>
<span style="color: #4c8317;"> "ALL_REGS" \</span>
<span style="color: #4c8317;"> }</span>
<span style="color: #4c8317;">#define REG_CLASS_CONTENTS \</span>
<span style="color: #4c8317;"> { \</span>
<span style="color: #4c8317;"> { 0x00000000 }, </span><span style="color: #aaaaaa; font-style: italic;">/* Empty */</span><span style="color: #4c8317;"> \</span>
<span style="color: #4c8317;"> { 0x0003FFFF }, </span><span style="color: #aaaaaa; font-style: italic;">/* $fp, $sp, $r0 to $r13, ?fp */</span><span style="color: #4c8317;"> \</span>
<span style="color: #4c8317;"> { 0x00040000 }, </span><span style="color: #aaaaaa; font-style: italic;">/* $pc */</span><span style="color: #4c8317;"> \</span>
<span style="color: #4c8317;"> { 0x00080000 }, </span><span style="color: #aaaaaa; font-style: italic;">/* ?cc */</span><span style="color: #4c8317;"> \</span>
<span style="color: #4c8317;"> { 0x000FFFFF } </span><span style="color: #aaaaaa; font-style: italic;">/* All registers */</span><span style="color: #4c8317;"> \</span>
<span style="color: #4c8317;"> }</span>
</pre>
<code>REG_CLASS_CONTENTS</code> consists of a bit mask for each register class, telling with registers are in the set. Architectures having more than 32 registers use several values for each register class, such as<br />
<pre class="code-snippet"><span style="color: #4c8317;"> { 0xFFFFFFFF, 0x000FFFFF } </span><span style="color: #aaaaaa; font-style: italic;">/* All registers */</span><span style="color: #4c8317;"> \</span>
</pre>
<br />
The back end also need to implement the <code>REGNO_REG_CLASS</code> macro that returns the register class for a register. There are in general several register classes containing the registers, and the macro should return a minimal class (i.e. one for which no smaller class contains the register)<br />
<pre class="code-snippet"><span style="color: #4c8317;">#define REGNO_REG_CLASS(REGNO) moxie_regno_to_class[ (REGNO) ]</span>
<span style="color: #0000aa;">const</span> <span style="color: #0000aa;">enum</span> reg_class riscv_regno_to_class[FIRST_PSEUDO_REGISTER] = {
GENERAL_REGS, GENERAL_REGS, GENERAL_REGS, GENERAL_REGS,
GENERAL_REGS, GENERAL_REGS, GENERAL_REGS, GENERAL_REGS,
GENERAL_REGS, GENERAL_REGS, GENERAL_REGS, GENERAL_REGS,
GENERAL_REGS, GENERAL_REGS, GENERAL_REGS, GENERAL_REGS,
GENERAL_REGS, GENERAL_REGS, SPECIAL_REGS, CC_REGS
};
</pre>
<br />
The back end may also need to set a few target macros detailing how values fit in registers, such as<br />
<pre class="code-snippet"><span style="color: #aaaaaa; font-style: italic;">/* A C expression that is nonzero if it is permissible to store a</span>
<span style="color: #aaaaaa; font-style: italic;"> value of mode MODE in hard register number REGNO (or in several</span>
<span style="color: #aaaaaa; font-style: italic;"> registers starting with that one). */ </span>
<span style="color: #4c8317;">#define HARD_REGNO_MODE_OK(REGNO,MODE) 1</span>
</pre>
See <a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gccint/Values-in-Registers.html#Values-in-Registers">section 17.7.2</a> in “<a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gccint.pdf">GNU Compiler Collection Internals</a>” for a list of those macros.<br />
<br />
<h3>
Addressing modes</h3>
Addressing modes varies a bit between architectures – most can do “base + index” addressing, but there are often restrictions on the format of the index etc. The general functionality is described by five macros:<br />
<pre class="code-snippet"><span style="color: #aaaaaa; font-style: italic;">/* Specify the machine mode that pointers have.</span>
<span style="color: #aaaaaa; font-style: italic;"> After generation of rtl, the compiler makes no further distinction</span>
<span style="color: #aaaaaa; font-style: italic;"> between pointers and any other objects of this machine mode. */</span>
<span style="color: #4c8317;">#define Pmode SImode</span>
<span style="color: #aaaaaa; font-style: italic;">/* Maximum number of registers that can appear in a valid memory</span>
<span style="color: #aaaaaa; font-style: italic;"> address. */</span>
<span style="color: #4c8317;">#define MAX_REGS_PER_ADDRESS 1</span>
<span style="color: #aaaaaa; font-style: italic;">/* A macro whose definition is the name of the class to which a</span>
<span style="color: #aaaaaa; font-style: italic;"> valid base register must belong. A base register is one used in</span>
<span style="color: #aaaaaa; font-style: italic;"> an address which is the register value plus a displacement. */</span>
<span style="color: #4c8317;">#define BASE_REG_CLASS GENERAL_REGS</span>
<span style="color: #aaaaaa; font-style: italic;">/* A macro whose definition is the name of the class to which a</span>
<span style="color: #aaaaaa; font-style: italic;"> valid index register must belong. An index register is one used</span>
<span style="color: #aaaaaa; font-style: italic;"> in an address where its value is either multiplied by a scale</span>
<span style="color: #aaaaaa; font-style: italic;"> factor or added to another register (as well as added to a</span>
<span style="color: #aaaaaa; font-style: italic;"> displacement). */</span>
<span style="color: #4c8317;">#define INDEX_REG_CLASS NO_REGS</span>
<span style="color: #aaaaaa; font-style: italic;">/* A C expression which is nonzero if register number NUM is suitable</span>
<span style="color: #aaaaaa; font-style: italic;"> for use as a base register in operand addresses. */</span>
<span style="color: #4c8317;">#ifdef REG_OK_STRICT</span>
<span style="color: #4c8317;">#define REGNO_OK_FOR_BASE_P(NUM) \</span>
<span style="color: #4c8317;"> (HARD_REGNO_OK_FOR_BASE_P(NUM) \</span>
<span style="color: #4c8317;"> || HARD_REGNO_OK_FOR_BASE_P(reg_renumber[(NUM)]))</span>
<span style="color: #4c8317;">#else</span>
<span style="color: #4c8317;">#define REGNO_OK_FOR_BASE_P(NUM) \</span>
<span style="color: #4c8317;"> ((NUM) >= FIRST_PSEUDO_REGISTER || HARD_REGNO_OK_FOR_BASE_P(NUM))</span>
<span style="color: #4c8317;">#endif</span>
<span style="color: #aaaaaa; font-style: italic;">/* A C expression which is nonzero if register number NUM is suitable</span>
<span style="color: #aaaaaa; font-style: italic;"> for use as an index register in operand addresses. */</span>
<span style="color: #4c8317;">#define REGNO_OK_FOR_INDEX_P(NUM) 0</span>
</pre>
where<br />
<pre class="code-snippet"><span style="color: #4c8317;">#define HARD_REGNO_OK_FOR_BASE_P(NUM) \</span>
<span style="color: #4c8317;"> ((unsigned) (NUM) < FIRST_PSEUDO_REGISTER \</span>
<span style="color: #4c8317;"> && (REGNO_REG_CLASS(NUM) == GENERAL_REGS \</span>
<span style="color: #4c8317;"> || (NUM) == HARD_FRAME_POINTER_REGNUM))</span>
</pre>
is a helper macro the Moxie back end has introduced.<br />
<br />
The macros <code>REGNO_OK_FOR_BASE_P</code> and <code>REGNO_OK_FOR_INDEX_P</code> have two versions – the normal version that accepts all pseudo registers as well as the valid hard registers, and one “strict” version that only accepts the valid hard registers. The strict version is used from within the register allocator, so it needs to check pseudo registers against the <code>reg_number</code> array to accept cases where the pseudo register is in the process of being allocated to a hard register.<br />
<br />
The back end also needs to implement the <code>TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P</code> hook that is used by the compiler to check that the address expression is valid (this is needed as the hardware may have more detailed constraints that cannot be expressed by the macros above). This is done similarly to the <code>valid_address_p</code> example in “inspecting the IR” above, but it needs to handle the “strict” case too. The Moxie back end implements this as<br />
<pre class="code-snippet"><span style="color: #0000aa;">static</span> <span style="color: #00aaaa;">bool</span>
<span style="color: #00aa00;">moxie_reg_ok_for_base_p</span> (const_rtx reg, <span style="color: #00aaaa;">bool</span> strict_p)
{
<span style="color: #00aaaa;">int</span> regno = REGNO (reg);
<span style="color: #0000aa;">if</span> (strict_p)
<span style="color: #0000aa;">return</span> HARD_REGNO_OK_FOR_BASE_P (regno)
|| HARD_REGNO_OK_FOR_BASE_P (reg_renumber[regno]);
<span style="color: #0000aa;">else</span>
<span style="color: #0000aa;">return</span> !HARD_REGISTER_NUM_P (regno)
|| HARD_REGNO_OK_FOR_BASE_P (regno);
}
<span style="color: #0000aa;">static</span> <span style="color: #00aaaa;">bool</span>
<span style="color: #00aa00;">moxie_legitimate_address_p</span> (machine_mode mode ATTRIBUTE_UNUSED,
rtx x, <span style="color: #00aaaa;">bool</span> strict_p,
<span style="color: #00aaaa;">addr_space_t</span> as)
{
gcc_assert (ADDR_SPACE_GENERIC_P (as));
<span style="color: #0000aa;">if</span> (GET_CODE(x) == PLUS
&& REG_P (XEXP (x, <span style="color: #009999;">0</span>))
&& moxie_reg_ok_for_base_p (XEXP (x, <span style="color: #009999;">0</span>), strict_p)
&& CONST_INT_P (XEXP (x, <span style="color: #009999;">1</span>))
&& IN_RANGE (INTVAL (XEXP (x, <span style="color: #009999;">1</span>)), -<span style="color: #009999;">32768</span>, <span style="color: #009999;">32767</span>))
<span style="color: #0000aa;">return</span> <span style="color: #00aaaa;">true</span>;
<span style="color: #0000aa;">if</span> (REG_P (x) && moxie_reg_ok_for_base_p (x, strict_p))
<span style="color: #0000aa;">return</span> <span style="color: #00aaaa;">true</span>;
<span style="color: #0000aa;">if</span> (GET_CODE (x) == SYMBOL_REF
|| GET_CODE (x) == LABEL_REF
|| GET_CODE (x) == CONST)
<span style="color: #0000aa;">return</span> <span style="color: #00aaaa;">true</span>;
<span style="color: #0000aa;">return</span> <span style="color: #00aaaa;">false</span>;
}
<span style="color: #4c8317;">#undef TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P</span>
<span style="color: #4c8317;">#define TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P moxie_legitimate_address_p</span>
</pre>
<br />
Finally, <code>TARGET_PRINT_OPERAND_ADDRESS</code> must be implemented to output the correct syntax for the address expressions in the assembly file generated by the compiler:<br />
<pre style="line-height: 125%; margin: 0;"><span style="color: #0000aa;">static</span> <span style="color: #00aaaa;">void</span>
<span style="color: #00aa00;">moxie_print_operand_address</span> (<span style="color: #00aaaa;">FILE</span> *file, machine_mode, rtx x)
{
<span style="color: #0000aa;">switch</span> (GET_CODE (x))
{
<span style="color: #0000aa;">case</span> REG:
fprintf (file, <span style="color: #aa5500;">"(%s)"</span>, reg_names[REGNO (x)]);
<span style="color: #0000aa;">break</span>;
<span style="color: #0000aa;">case</span> PLUS:
<span style="color: #0000aa;">switch</span> (GET_CODE (XEXP (x, <span style="color: #009999;">1</span>)))
{
<span style="color: #0000aa;">case</span> CONST_INT:
fprintf (file, <span style="color: #aa5500;">"%ld(%s)"</span>,
INTVAL(XEXP (x, <span style="color: #009999;">1</span>)), reg_names[REGNO (XEXP (x, <span style="color: #009999;">0</span>))]);
<span style="color: #0000aa;">break</span>;
<span style="color: #0000aa;">case</span> SYMBOL_REF:
output_addr_const (file, XEXP (x, <span style="color: #009999;">1</span>));
fprintf (file, <span style="color: #aa5500;">"(%s)"</span>, reg_names[REGNO (XEXP (x, <span style="color: #009999;">0</span>))]);
<span style="color: #0000aa;">break</span>;
<span style="color: #0000aa;">case</span> CONST:
{
rtx plus = XEXP (XEXP (x, <span style="color: #009999;">1</span>), <span style="color: #009999;">0</span>);
<span style="color: #0000aa;">if</span> (GET_CODE (XEXP (plus, <span style="color: #009999;">0</span>)) == SYMBOL_REF
&& CONST_INT_P (XEXP (plus, <span style="color: #009999;">1</span>)))
{
output_addr_const(file, XEXP (plus, <span style="color: #009999;">0</span>));
fprintf (file,<span style="color: #aa5500;">"+%ld(%s)"</span>, INTVAL (XEXP (plus, <span style="color: #009999;">1</span>)),
reg_names[REGNO (XEXP (x, <span style="color: #009999;">0</span>))]);
}
<span style="color: #0000aa;">else</span>
abort();
}
<span style="color: #0000aa;">break</span>;
default:
abort();
}
<span style="color: #0000aa;">break</span>;
default:
output_addr_const (file, x);
<span style="color: #0000aa;">break</span>;
}
}
<span style="color: #4c8317;">#undef TARGET_PRINT_OPERAND_ADDRESS</span>
<span style="color: #4c8317;">#define TARGET_PRINT_OPERAND_ADDRESS moxie_print_operand_address</span>
</pre>
<br />
<h3>
Further reading</h3>
All functionality is described in “<a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gccint.pdf">GNU Compiler Collection Internals</a>”:<br />
<ul>
<li><a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gccint/index.html#toc-RTL-Representation">Chapter 13</a> describes the RTL representation, including macros and functions for accessing it.</li>
<li><a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gccint/Target-Macros.html#Target-Macros">Chapter 17</a> describes all target description macros and functions.</li>
</ul>
</div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com0tag:blogger.com,1999:blog-2420869991974935734.post-54353283061655157972017-08-22T17:01:00.000+02:002017-08-29T14:02:49.553+02:00GCC low-level IR and basic code generation<div style="text-align: justify;">
<i>This is part three of a series “<a href="https://kristerw.blogspot.se/2017/08/writing-gcc-backend_4.html">Writing a GCC back end</a>”.</i></div>
<div style="text-align: justify;">
<br />
Compilers are usually described as having three parts – a front end that handles the source language, a middle that optimizes the code using a target-independent representation, and a back end doing the target-specific code generation. GCC is no different in this regard – the front end generates a target-independent representation (GIMPLE) that is used when optimizing the code, and the result is passed to the back end that converts it to its own representation (RTL) and generates the code for the program.</div>
<div style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
The back end IR</h3>
<div style="text-align: justify;">
The back end’s internal representation for the code consists of a linked list of objects called <i>insns</i>. An insn corresponds roughly to an assembly instruction, but there are insns representing labels, dispatch tables for <code>switch</code> statements, etc.. Each insnsn is constructed as a tree of <i>expressions</i>, and is usually written in an LISP-like syntax. For example,<br />
<pre class="code-snippet">(reg:<i>m</i> <i>n</i>)
</pre>
is an expression representing a register access, and<br />
<pre class="code-snippet">(plus:<i>m</i> <i>x</i> <i>y</i>)
</pre>
represents adding the expressions <code><i>x</i></code> and <code><i>y</i></code>. An insn adding two registers may combine these as<br />
<pre class="code-snippet">(set (reg:<i>m</i> <i>r0</i>) (plus:<i>m</i> (reg:<i>m</i> <i>r1</i>) (reg:<i>m</i> <i>r2</i>)))
</pre>
The <code><i>m</i></code> in the expressions denotes a <i>machine mode</i> that defines the size and representation of the data object or operation in the expression. There are <a href="https://gcc.gnu.org/onlinedocs/gccint/Machine-Modes.html#Machine-Modes">lots of machine modes</a>, but the most common are<br />
<ul>
<li><code>QI</code> – “Quarter-Integer” mode represents a single byte treated as an integer.</li>
<li><code>HI</code> – “Half-Integer” mode represents a two-byte integer.</li>
<li><code>SI</code> – “Single Integer” mode represents a four-byte integer.</li>
<li><code>DI</code> – “Double Integer” mode represents an eight-byte integer.</li>
<li><code>CC</code> – “Condition Code” mode represents the value of a condition code (used to represent the result of a comparison operation).</li>
</ul>
GCC supports architectures where a byte is not 8 bits, but this blog series will assume 8-bit bytes (mostly as I often find it clearer to talk about a 32-bit value in examples, instead of a more abstract <code>SImode</code> value).<br />
<br />
<h3>
Overview of the back end operation</h3>
The back end runs a number of passes over the IR, and GCC will output the resulting RTL for each pass when <code>-fdump-rtl-all</code> is passed to the compiler.<br />
<br />
The back end starts by converting GIMPLE to RTL, and a small GIMPLE function such as<br />
<pre class="code-snippet">foo (int a)
{
int _2;
<bb 2> [100.00%]:
_2 = a_1(D) * 42;
return _2;
}
</pre>
is expanded to<br />
<pre class="code-snippet">(note 1 0 4 NOTE_INSN_DELETED)
(note 4 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 2 4 3 2 (set (reg/v:SI 73 [ a ])
(reg:SI 10 a0 [ a ])) "foo.c":2 -1
(nil))
(note 3 2 6 2 NOTE_INSN_FUNCTION_BEG)
(insn 6 3 7 2 (set (reg:SI 75)
(const_int 42 [0x2a])) "foo.c":3 -1
(nil))
(insn 7 6 8 2 (set (reg:SI 74)
(mult:SI (reg/v:SI 73 [ a ])
(reg:SI 75))) "foo.c":3 -1
(nil))
(insn 8 7 12 2 (set (reg:SI 72 [ <retval> ])
(reg:SI 74)) "foo.c":3 -1
(nil))
(insn 12 8 13 2 (set (reg/i:SI 10 a0)
(reg:SI 72 [ <retval> ])) "foo.c":4 -1
(nil))
(insn 13 12 0 2 (use (reg/i:SI 10 a0)) "foo.c":4 -1
(nil))
</pre>
</div>
<div style="text-align: justify;">
The generated RTL corresponds mostly to real instructions even at this early stage in the back end, but the generated code it is inefficient and registers are still virtual.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The next step is to run optimization passes on the RTL. This is the same kind of optimization passes that have already been run on GIMPLE (constant folding, dead code elimination, simple loop optimizations, etc.), but they can do a better job with knowledge of the target architecture. For example, loop optimizations may transform loops to better take advantage of loop instructions and addressing modes, and dead code elimination may see that the operations working on the upper part of<br />
<pre class="code-snippet">foo (long long int a, long long int b)
{
long long int _1;
long long int _4;
<bb 2> [100.00%]:
_1 = a_2(D) + b_3(D);
_4 = _1 & 255;
return _4;
}
</pre>
</div>
on a 32-bit architecture are not needed (as the returned value is always <code>0</code>) and can thus be eliminated.<br />
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
After this, instructions are combined and spit to better instruction sequences by peep-hole optimizations, registers are allocated, the instructions are scheduled, and the resulting RTL dump contains all the information about which instructions are selected and registers allocated</div>
<pre class="code-snippet">(note 1 0 4 NOTE_INSN_DELETED)
(note 4 1 17 [bb 2] NOTE_INSN_BASIC_BLOCK)
(note 17 4 2 NOTE_INSN_PROLOGUE_END)
(note 2 17 3 NOTE_INSN_DELETED)
(note 3 2 7 NOTE_INSN_FUNCTION_BEG)
(note 7 3 15 NOTE_INSN_DELETED)
(insn 15 7 12 (set (reg:SI 15 a5 [75])
(const_int 42 [0x2a])) "foo.c":4 132 {*movsi_internal}
(expr_list:REG_EQUIV (const_int 42 [0x2a])
(nil)))
(insn 12 15 13 (set (reg/i:SI 10 a0)
(mult:SI (reg:SI 10 a0 [ a ])
(reg:SI 15 a5 [75]))) "foo.c":4 15 {mulsi3}
(expr_list:REG_DEAD (reg:SI 15 a5 [75])
(nil)))
(insn 13 12 21 (use (reg/i:SI 10 a0)) "foo.c":4 -1
(nil))
(note 21 13 19 NOTE_INSN_EPILOGUE_BEG)
(jump_insn 19 21 20 (simple_return) "foo.c":4 211 {simple_return}
(nil)
-> simple_return)
(barrier 20 19 16)
(note 16 20 0 NOTE_INSN_DELETED)
</pre>
<div style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
Instruction patterns</h3>
<div style="text-align: justify;">
The target architecture’s instructions are described in <code><i>machine</i>.md</code> using <i>instruction patterns</i>. A simple instruction pattern looks like<br />
<pre class="code-snippet">(define_insn "mulsi3"
[(set (match_operand:SI 0 "register_operand" "=r")
(mult:SI (match_operand:SI 1 "register_operand" "r")
(match_operand:SI 2 "register_operand" "r")))]
""
"mul\t%0,%1,%2")
</pre>
which defines an insn with a name <code>mulsi3</code> that generates a <code>mul</code> instruction for a 32-bit multiplication.<br />
<br />
The first operand in the instruction pattern is a <i>name</i> that is used in debug dumps and when writing C++ code that generates RTL. For example,<br />
<pre class="code-snippet">emit_insn (gen_mulsi3 (dst, src1, src2));
</pre>
generates a <code>mulsi3</code> insn. The back end does not in general need to generate RTL, but the rest of GCC does, and the name tells GCC that it can use the pattern to accomplish a certain task. It is therefore important that the named patterns implement the <a href="https://gcc.gnu.org/onlinedocs/gccint/Standard-Names.html#Standard-Names">functionality that GCC expects</a>, but names starting with <code>*</code> are ignored and thus safe to use for non-standard instruction patterns. The back end should only implement the named patterns that make sense for the architecture – GCC will do its best to emit code for missing patterns using other strategies. For example, a 32-bit multiplication will be generated as a call to <code>__mulsi3</code> in libgcc if the target does not have a <code>mulsi3</code> instruction pattern.<br />
<br />
The next part of the instruction pattern is the <i>RTL template</i> that describes the semantics of the instruction that is generated by the insn<br />
<pre class="code-snippet">[(set (match_operand:SI 0 "register_operand" "=r")
(mult:SI (match_operand:SI 1 "register_operand" "r")
(match_operand:SI 2 "register_operand" "r")))]
</pre>
This says that the instruction multiplies two registers and places the result in a register. This example is taken from RISC-V that has a multiplication instruction without any side-effects, but some architectures (such as X86_64) sets condition flags as part of the operation, and that needs to be expressed in the RTL template, such as<br />
<pre class="code-snippet">[(parallel [(set (match_operand:SI 0 "register_operand" "=r")
(mult:SI (match_operand:SI 1 "register_operand" "r")
(match_operand:SI 2 "register_operand" "r")))
(clobber (reg:CC FLAGS_REG))])]
</pre>
where the <code>parallel</code> expresses that the two operations are done in as a unit.<br />
<br />
The instruction’s operands are specified with expressions of the form<br />
<pre class="code-snippet">(match_operand:SI 1 "register_operand" "r")</pre>
consisting of four parts:<br />
<ul>
<li><code>match_operand:</code> followed by the machine mode of the operand.</li>
<li>The operand number used as an identifier when referring the operand.</li>
<li>A <i>predicate</i> telling what kind of operands are valid (<code>"register_operand"</code> means that it must be a register).</li>
<li>A <i>constraint</i> string describing the details of the operand (<code>"r"</code> means it must be a general register). These are the same constraints as are used in inline assembly (but the instruction patterns support additional constraints that are not allowed in inline assembly).</li>
</ul>
The predicate and constraint string contain similar information, but they are used in different ways:<br />
<ul>
<li>The predicate is used when generating the RTL. As an example, when generating RTL for a GIMPLE function <pre class="code-snippet">foo (int a)
{
int _2;
<bb 2> [100.00%]:
_2 = a_1(D) * 42;
return _2;
}
</pre>
then the GIMPLE to RTL converter will generate the multiplication as a <code>mulsi3</code> insn. The predicates will be checked for the result <code>_2</code> and the operands <code>a_1(D)</code> and <code>42</code>, and it is determined that <code>42</code> is not valid as only registers are allowed. GCC will, therefore, insert a <code>movsi</code> insn that moves the constant into a register.</li>
<li>The constraints are used when doing the register allocation and final instruction selection. Many architectures have constraints on the operands, such as m68k that has 16 registers (<code>a0</code>–<code>a7</code> and <code>d0</code>–<code>d7</code>), but only <code>d0</code>–<code>d7</code> are allowed in a <code>muls</code> instruction. This is expressed by a constraint telling the register allocator that it must use register <code>d0</code>–<code>d7</code>.</li>
</ul>
<br />
The string after the RTL template contains C++ code that disables the instruction pattern if it evaluates to false (an empty string is treated as true). This is used when having different versions of the architecture, such as one for small CPUs that do not have multiplication instructions, and one for more advanced cores that can do multiplication. This is typically handled by letting <code><i>machine</i>.opt</code> generate a global variable <code>TARGET_MUL</code> that can be set by an option such as <code>-mmul</code> and <code>-march</code>, and this global variable is placed in the condition string<br />
<pre class="code-snippet">(define_insn "mulsi3"
[(set (match_operand:SI 0 "register_operand" "=r")
(mult:SI (match_operand:SI 1 "register_operand" "r")
(match_operand:SI 2 "register_operand" "r")))]
"TARGET_MUL"
"mul\t%0,%1,%2")
</pre>
so that the instruction pattern it is disabled (and the compiler will thus generate a call to libgcc) when multiplication is not available.<br />
<br />
The resulting instruction is emitted using the <i>output template</i>
<br />
<pre class="code-snippet">"mul\t%0,%1,%2"
</pre>
where <code>%0</code>, <code>%1</code>, ..., are substituted with the corresponding operands.<br />
<br />
<h3>
More about <code>define_insn</code></h3>
Many architectures need more complex instruction handling than the RISC-V <code>mul</code> instruction described above, but <code>define_insn</code> is flexible enough to handle essentially all cases that occur for real CPUs.<br />
<br />
Let’s say our target can multiply a register and an immediate integer and that this requires the first operand to be an even-numbered register, while multiplying two registers requires that the first operand is an odd-numbered register (this is not as strange as it may seem – some CPU designs use such tricks to save one bit in the instruction encoding). This is easily handled by defining a new predicate in <code>predicates.md</code><br />
<pre class="code-snippet">(define_predicate "reg_or_int_operand"
(ior (match_code "const_int")
(match_operand 0 "register_operand")))
</pre>
accepting a register or an integer, and new constraints in <code>constraints.md</code> that require an odd or an even register<br />
<pre class="code-snippet">(define_register_constraint "W" "ODD_REGS"
"An odd-numbered register.")
(define_register_constraint "D" "EVEN_REGS"
"An even-numbered register.")
</pre>
where <code>ODD_REGS</code> and <code>EVEN_REGS</code> are register classes (see <a href="https://kristerw.blogspot.se/2017/08/gcc-target-description-macros-and.html">part four</a> in this series). We can now use this in the instruction pattern<br />
<pre class="code-snippet">(define_insn "mulsi3"
[(set (match_operand:SI 0 "register_operand" "=r,r")
(mult:SI (match_operand:SI 1 "register_operand" "%W,D")
(match_operand:SI 2 "reg_or_int_operand" "r,i")))]
""
"mul\t%0,%1,%2")
</pre>
The constraint strings now list two alternatives – one for the register/register case and one for the register/integer case. And there is a <code>%</code> character added to tell the back end that the operation is commutative (so that the code generation may switch the order of the operands if the integer is the first operand, or help the register allocation for cases such as<br />
<pre class="code-snippet">_1 = _2 * 42;
_3 = _2 * _4;
</pre>
to let it change the order of arguments on the second line to avoid inserting an extra move to make <code>_2</code> an even register on the first line and an odd register on the second line).<br />
<br />
Sometimes the different alternatives need to generate different instructions, such as the instruction multiplying two registers being called <code>mulr</code> and the multiplication with an integer being called <code>muli</code>. This can be handled by starting the output template string with an <code>@</code> character and listing the different alternatives in the same order as in the constraint strings<br />
<pre class="code-snippet">(define_insn "mulsi3"
[(set (match_operand:SI 0 "register_operand" "=r,r")
(mult:SI (match_operand:SI 1 "register_operand" "%W,D")
(match_operand:SI 2 "reg_or_int_operand" "r,i")))]
""
"@
mulr\t%0,%1,%2
muli\t%0,%1,%2")
</pre>
Finally, it is possible to write general C++ code that is run when outputting the instruction, so the previous could have been written as<br />
<pre class="code-snippet">(define_insn "mulsi3"
[(set (match_operand:SI 0 "register_operand" "=r,r")
(mult:SI (match_operand:SI 1 "register_operand" "%W,D")
(match_operand:SI 2 "reg_or_int_operand" "r,i")))]
""
{
return which_alternative == 0 ? "mulr\t%0,%1,%2" : "muli\t%0,%1,%2";
})
</pre>
This is usually not that useful for <code>define_insn</code> but may reduce the number of instruction patterns when the instruction names depend on the configuration. For example, <code>mulsi3</code> in the RISC-V back end must generate <code>mulw</code> in 64-bit mode and <code>mul</code> in 32-bit mode, which is implemented as<br />
<pre class="code-snippet">(define_insn "mulsi3"
[(set (match_operand:SI 0 "register_operand" "=r")
(mult:SI (match_operand:SI 1 "register_operand" "r")
(match_operand:SI 2 "register_operand" "r")))]
""
{ return TARGET_64BIT ? "mulw\t%0,%1,%2" : "mul\t%0,%1,%2"; })
</pre>
where <code>TARGET_64BIT</code> is a global variable defined in <code>riscv.opt</code>.<br />
<br />
<h3>
Further reading</h3>
This blog post has only scratched the surface of the RTL and machine description functionality, but everything is documented in “<a href="https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gccint.pdf">GNU Compiler Collection Internals</a>”:<br />
<ul>
<li><a href="https://gcc.gnu.org/onlinedocs/gccint/RTL-passes.html#RTL-passes">Section 9.6</a> describes the RTL generation and optimization passes that are run in the back end.</li>
<li><a href="https://gcc.gnu.org/onlinedocs/gccint/RTL.html#RTL">Chapter 13</a> describes the RTL, including all RTL expressions.</li>
<li><a href="https://gcc.gnu.org/onlinedocs/gccint/Machine-Desc.html#Machine-Desc">Chapter 16</a> describes machine description, including all <a href="https://gcc.gnu.org/onlinedocs/gccint/Standard-Names.html#Standard-Names">predefined pattern names</a>, <a href="https://gcc.gnu.org/onlinedocs/gccint/Predicates.html#Predicates">predicates</a>, and <a href="https://gcc.gnu.org/onlinedocs/gccint/Constraints.html#Constraints">constraints</a>.</li>
</ul>
</div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com0tag:blogger.com,1999:blog-2420869991974935734.post-58371740909293479922017-08-13T18:44:00.000+02:002017-08-13T18:44:34.092+02:00Getting started with a GCC back end<div style="text-align: justify;">
<i>This is part two of a series “<a href="https://kristerw.blogspot.se/2017/08/writing-gcc-backend_4.html">Writing a GCC back end</a>”.</i><br />
<div>
<br />
Most CPU architectures have a common subset – they have instructions doing arithmetics and bit operations on a few general registers, an instruction that can write a register to memory, and an instruction that can read from memory and place the result in a register. It is therefore easy to make a compiler that can compile simple straight-line functions by taking an existing back end and restricting it to this common subset. This is enough to start running the test suite, and it is then straightforward to address one deficiency at a time (adding additional instructions, addressing modes, ABI, etc.).<br />
<div style="text-align: justify;">
<br />
My original thought was that the <a href="https://riscv.org/">RISC-V</a> back end would be a good choice as a starting point – the architecture is <a href="https://riscv.org/specifications/">fully documented</a>, and it is a new, actively maintained, backend that does not use legacy APIs. But the RISC-V back end has lots of functionality (such as support for multiple ISA profiles, 32- and 64-bit modes, and features such as position-independent code, exception handling and debug information) and the work of reducing it became unnecessarily complicated when I tried...<br />
<br />
I now think it is better to start from one of the minimal back ends, such as the back end for the <a href="http://moxielogic.org/blog/pages/architecture.html">Moxie architecture</a>. Moxie seems to be a good choice as there is also a blog series “<a href="http://atgreen.github.io/ggx/">How To Retarget the GNU Toolchain in 21 Patches</a>” describing step-by-step how it was developed. The blog series is old, but GCC has a very stable API, so it is essentially the same now (I once updated a GCC backend from GCC 4.3 to GCC 4.9, which were released 6 years apart, and only a few lines needed to be modified...).<br />
<br />
One thing missing from the Moxie blog series is how to build the compiler and how to configure and run the test-suite, but I blogged about that a while back in “<a href="https://kristerw.blogspot.se/2015/05/running-gcc-testsuite-for-epiphany-sim.html">Running the GCC test-suite for epiphany-sim</a>”.</div>
</div>
</div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com0tag:blogger.com,1999:blog-2420869991974935734.post-85483267905446454522017-08-06T09:06:00.000+02:002017-08-06T09:06:38.843+02:00The structure of a GCC back end<div style="text-align: justify;">
<i>This is part one of a series “<a href="https://kristerw.blogspot.se/2017/08/writing-gcc-backend_4.html">Writing a GCC back end</a>”.</i></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The GCC back end is configured in <code>gcc/config.host</code> and the implementation is placed in directories <code><i>machine</i></code> under <code>gcc/config</code> and <code>gcc/common/config</code> where “machine” is the name of the back end (for example, <code>i386</code> for the x86 architecture).<br />
<br />
The back end places some functionality in <code>libgcc</code>. For example, architectures that do not have an instruction for integer division will instead generate a call to a function <code>__divsi3</code> in <code>libgcc</code>. <code>libgcc</code> is configured in <code>libgcc/config.host</code> and target-specific files are located in a directory <code><i>machine</i></code> under <code>libgcc/config</code>.<br />
<br /></div>
<h3 style="text-align: justify;">
gcc/config.gcc</h3>
<div style="text-align: justify;">
<code>config.gcc</code> is a shell script that parses the target string (e.g. <code>x86_64-linux-gnu</code>) and sets variables pointing out where to find the rest of the back end and how to compile it. The variables that can be set are documented at the top of the <code>config.gcc</code> file.<br />
<br />
The only variable that must be set is <code>cpu_type</code> that specifies <code><i>machine</i></code>. Most targets also set <code>extra_objs</code> that specifies extra object files that should be linked into the compiler, <code>tmake_file</code> that contains makefile fragments that compiles those extra objects (or sets makefile variables modifying the build), and <code>tm_file</code> that adds header files containing target-specific information.<br />
<br />
A typical configuration for a simple target (such as <code>ft32-unknown-elf</code>) looks something like<br />
<pre class="code-snippet">cpu_type=ft32
tm_file="dbxelf.h elfos.h newlib-stdint.h ${tm_file}"
</pre>
</div>
<div style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
gcc/config/<i>machine</i></h3>
<div style="text-align: justify;">
The main part of the back end is located in <code>gcc/config/<i>machine</i></code>. It consists of eight different components, each implemented in a separate file:</div>
<ul>
<li><div style="text-align: justify;">
<code><i>machine</i>.h</code> is included all over the compiler and contains macros defining properties of the target, such as the size of integers and pointers, number of registers, data alignment rules, etc.</div>
</li>
<li><div style="text-align: justify;">
GCC implements a generic backend where <code><i>machine</i>.c</code> can override most of the functionality. The backend is written in C,<sup>1</sup> so the virtual functions are handled manually with function pointers in a structure, and <code><i>machine</i>.c</code> overrides the defaults using code of the form</div>
<pre class="code-snippet" style="text-align: justify;"><span style="color: #4c8317;">#undef TARGET_FRAME_POINTER_REQUIRED</span>
<span style="color: #4c8317;">#define TARGET_FRAME_POINTER_REQUIRED ft32_frame_pointer_required</span>
<span style="color: #0000aa;">static</span> <span style="color: #00aaaa;">bool</span>
<span style="color: #00aa00;">ft32_frame_pointer_required</span> (<span style="color: #00aaaa;">void</span>)
{
<span style="color: #0000aa;">return</span> cfun->calls_alloca;
}
</pre>
</li>
<li><code style="text-align: justify;"><i>machine</i>-protos.h</code><span style="text-align: justify;"> contains prototypes for the external functions defined in</span> <code style="text-align: justify;"><i>machine</i>.c</code>.</li>
<li><div style="text-align: justify;">
<code><i>machine</i>.opt</code> adds target-specific command-line options to the compiler using a record format specifying the option name, properties, and a documentation string for the <code>--help</code> output. For example,<br />
<pre class="code-snippet" style="text-align: justify;">msmall-data-limit=
Target Joined Separate UInteger Var(g_switch_value) Init(8)
-msmall-data-limit=N Put global and static data smaller than <number> bytes into a special section.
</pre>
</div>
</li>
adds a command-line option <code>-msmall-data-limit</code> that has a default value 8, and is generated as an unsigned variable named <code>g_switch_value</code>.
<li style="text-align: justify;"><code><i>machine</i>.md</code>, <code>predicates.md</code>, and <code>constraints.md</code> contain the machine description consisting of rules for instruction selection and register allocation, pipeline description, and peephole optimizations. These will be covered in detail in parts 3–7 of this series.</li>
<li style="text-align: justify;"><code><i>machine</i>-modes.def</code> defines extra machine modes for use in the low-level IR (a <span style="text-align: justify;">“machine mode” in the GCC terminology defines the size and representation of a data object. That is, it is a data type.</span>). This is typically used for condition codes and vectors.</li>
</ul>
<span style="text-align: justify;">The GCC configuration is very flexible and everything can be </span>overridden,<span style="text-align: justify;"> so some back ends look slightly different as they, for example, add </span>several <code>.opt</code> files<span style="text-align: justify;"> by setting</span> <code>extra_options</code> in <code>config.gcc</code>.<br />
<div style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
gcc/common/config/<i>machine</i></h3>
<div>
<div style="text-align: justify;">
The <code>gcc/common/config/<i>machine</i></code> directory contains a file <code><i>machine</i>-common.c</code> that can add/remove optimization passes, change the defaults for <code>--param</code> values, etc.<br />
<br />
Many back ends do not need to do anything here, and this file can be disabled by setting<br />
<pre class="code-snippet">target_has_targetm_common=no
</pre>
in <code>config.gcc</code>.</div>
<div style="text-align: justify;">
<br /></div>
</div>
<h3 style="text-align: justify;">
libgcc/config.host</h3>
<div style="text-align: justify;">
The libgcc <code>config.host</code> works in the same way as <code>config.gcc</code>, but with different variables.<br />
<br />
The only variable that must be set is <code>cpu_type</code> that specifies <code><i>machine</i></code>. Most targets also set <code>extra_parts</code> that specifies extra object files to include in the library and <code>tmake_file</code> that contains makefile fragments that add extra functionality (such as soft-float support).<br />
<br />
A typical configuration for a simple target looks something like<br />
<pre class="code-snippet">cpu_type=ft32
tmake_file="$tmake_file t-softfp"
extra_parts="$extra_parts crti.o crtn.o crtbegin.o crtend.o"
</pre>
<br />
<h3>
libgcc/config/<i>machine</i></h3>
The <code>libgcc/config/<i>machine</i></code> directory contains extra files that may be needed for the target architecture. Simple implementations typically only contain a <code>crti.S</code> and <code>crtn.S</code> (<code>crtbegin</code>/<code>crtend</code> and the makefile support for building all of these have default implementation) and a file <code>sfp-machine.h</code> containing defaults for the soft-float implementation.<br />
<br />
<hr />
<span style="font-size: x-small;">1. GCC is written in C++03 these days, but the structure has not been changed since it was written in C.</span></div>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com0tag:blogger.com,1999:blog-2420869991974935734.post-51373254477112948312017-08-04T07:24:00.000+02:002018-01-21T12:38:21.499+01:00Writing a GCC back end<div style="text-align: justify;">
It is surprisingly easy to design a CPU (see for example Colin Riley’s <a href="http://labs.domipheus.com/blog/designing-a-cpu-in-vhdl-part-1-rationale-tools-method/">blog series</a>) and I was recently asked how hard it is to write a GCC back end for your new architecture. That too is easy <span style="background-color: white; color: #666666; font-family: Georgia, Utopia, "Palatino Linotype", Palatino, serif; font-size: 13.2px;">–</span> provided you have done it once before. But the first time is quite painful...</div>
<div style="text-align: justify;">
<br />
I plan to write some blog posts the coming weeks that will try to ease the pain by showing what is involved in creating a “working” back end that is capable of compiling simple functions, give some pointers to how to proceed to make this production-ready, and in general provide the overview I would have liked before I started developing my backend (GCC has a good reference manual, “<a href="https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gccint.pdf">GNU Compiler Collection Internals</a>”, describing everything you need to know, but it is a bit overwhelming when you start...)<br />
<br /></div>
<div style="text-align: justify;">
The series covers the following (I’ll update the list with links to the posts as they become available) </div>
<div style="text-align: justify;">
</div>
<ol>
<li><a href="https://kristerw.blogspot.se/2017/08/the-structure-of-gcc-back-end.html">The structure of a GCC back end</a></li>
<ul>
<li>Which files you need to write/modify</li>
</ul>
<li><a href="https://kristerw.blogspot.se/2017/08/getting-started-with-gcc-back-end.html">Getting started with a GCC back end</a></li>
<ul>
<li>Pointers to resources describing how to set up the initial back end</li>
</ul>
<li><a href="https://kristerw.blogspot.se/2017/08/gcc-low-level-ir-and-basic-code.html">Low-level IR and basic code generation</a></li>
<ul>
<li>How the low-level IR works</li>
<li>How the IR is lowered to instructions</li>
<li>How to write simple instruction patterns</li>
</ul>
<li><a href="https://kristerw.blogspot.se/2017/08/gcc-target-description-macros-and.html">Target description macros and functions</a></li>
<ul>
<li>Working with the RTL</li>
<li>Describing the registers (names, register classes, allocation order, ...)</li>
<li>Addressing modes</li>
</ul>
<li><a href="https://kristerw.blogspot.se/2017/12/more-about-gcc-instruction-patterns.html">More about instruction patterns</a></li>
<ul>
<li><code>define_expand</code></li>
<li>The <code>unspec</code> and <code>unspec_volatile</code> expression codes</li>
<li>Attributes</li>
</ul>
<li><a href="https://kristerw.blogspot.se/2018/01/gcc-back-end-performance-tuning.html">Improving the performance</a></li>
<ul>
<li>Cost model</li>
<li>Peephole optimization</li>
<li>Tuning the optimization passes</li>
</ul>
<li>Instruction scheduling</li>
<ul>
<li>Describing the processor pipeline</li>
</ul>
</ol>
Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com10tag:blogger.com,1999:blog-2420869991974935734.post-37723181428071195332017-07-24T19:49:00.000+02:002017-07-24T19:49:10.919+02:00Phoronix SciMark benchmarking results<div style="text-align: justify;">
Phoronix recently published an article “<a href="http://www.phoronix.com/scan.php?page=article&item=gcc8-clang5-znver1&num=1">Ryzen Compiler Performance: Clang 4/5 vs. GCC 6/7/8 Benchmarks</a>”, and there are many results in that article that surprises me...<br />
<br /></div>
<div style="text-align: justify;">
One such is the <a href="http://www.phoronix.com/scan.php?page=article&item=gcc8-clang5-znver1&num=2">result for SciMark</a> that shows that GCC generates much slower code than LLVM – there is a big difference in several tests, and the composite score is 25% lower. I do not have any Ryzen CPU to test on, but my testing on Broadwell shows very little difference between GCC and LLVM when SciMark is compiled with <code>-O3 -march=x86-64</code> as in the article, and the Ryzen microarchitecture should not introduce <i>that</i> big difference. And the reported numbers seem low...</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The Phoronix testing also shows strange performance variations between different GCC versions that I don’t see in my testing – I see a performance increase for each newer version of the compiler.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The published test results are made running scripts available at <a href="http://openbenchmarking.org/">OpenBenchmarking.org</a>, and looking at the <a href="http://openbenchmarking.org/innhold/a017b89f87b7d38f270d2541bc92c26410d33855">build script for SciMark</a> shows that it is built as<br />
<pre class="code-snippet">cc $CFLAGS -o scimark2 -O *.c -lm
</pre>
Note the <code>-O</code> – this overrides the optimization level set by <code>$CFLAGS</code> and explains at least some of the discrepancies in the test results.<sup>1</sup> GCC maps <span style="white-space: nowrap;"><code>-O</code></span> to the <span style="white-space: nowrap;"><code>-O1</code></span> optimization level that is meant to be a good choice to use while developing – it optimizes the code, but focuses as much on fast compilation time and good debug information as on producing fast code. LLVM maps <span style="white-space: nowrap;"><code>-O</code></span> to <span style="white-space: nowrap;"><code>-O2</code></span> that is a “real” optimization level that prioritizes performance, so it is not surprising that LLVM produces faster code in this case.<br />
<br />
So the benchmarking result does not show what is intended, and both compilers can do better than what the test results show...</div>
<br />
<hr />
<span style="font-size: x-small;">1. I get similar results as the article when I use <code>-O</code>, but my result for FFT is very different...</span>Krister Walfridssonhttp://www.blogger.com/profile/02591279630933941271noreply@blogger.com0