## Sunday, April 10, 2016

### Why volatile is hard to specify and implement

#### Volatile in C

The main use case for volatile in C is memory mapped hardware where reads and writes may change the hardware state (even reads may change the state — it is common that e.g. a read of an interrupt register resets the value). This means that the compiler must not eliminate, introduce, or reorder memory accesses to such memory.

Real world hardware have restrictions that makes it is hard to specify access rules in a way that works on all hardware. A few examples:
• The original DEC Alpha processor could only do 32-bit load/store operations, so access to an 8-bit variable must be done by reading 32 bits, modifying the value, and writing back the result. This means that writing one element in
volatile char a[4];

will read and write all the elements.
• Unaligned load/store operations are split into two operations on the bus.
• The hardware may have restrictions on how read-modify-write instructions are used for memory mapped hardware, so e.g. incrementing a volatile variable as a++ may need to be done as a load, an addition, and a store, even if the CPU has an instruction that increments a value in memory.
The standard solves these problems by saying that accesses to volatile objects need to be done as in the source code, but it is up to the implementation to define what constitutes an "access". The text of this is in C11 6.7.3 "Type qualifiers":
An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine, as described in 5.1.2.3. Furthermore, at every sequence point the value last stored in the object shall agree with that prescribed by the abstract machine, except as modified by the unknown factors mentioned previously. What constitutes an access to an object that has volatile-qualified type is implementation-defined.
This means that an implementation can do essentially anything for volatile as long as it can cover this by a definition of "access", but even a reasonable definition may produce surprises. For example, the text above talks about "at every sequence point". This means that the compiler is allowed to optimize
volatile int *p = somevalue;
int s = *p + *p;

as if it said
volatile int *p = somevalue;
int tmp = *p;
int s = tmp + tmp;

as both reads of *p are between sequence points. A similar case is when the value of an assignment expression is used, such as
volatile int a, b;
a = b = 0;

The standard does not say if b is read, i.e. the compiler is allowed to compile this as
b = 0; a = 0;
or
b = 0; a = b;

The examples above are probably not that common, but even trivial cases may be surprising. Consider the load *p in
volatile int *p = somevalue;
*p;

C++ compilers must not generate a load for this case (because of the C++ rules regarding lvalue-to-rvalue conversion). Many developers argue that C compilers must generate a load, but I'm not completely convinced. And, for example, the GCC 5.3.0 manual describes this as if the compiler has a choice:
According to the C standard, such an expression is an rvalue whose type is the unqualified version of its original type, i.e. int. Whether GCC interprets this as a read of the volatile object being pointed to or only as a request to evaluate the expression for its side-effects depends on this type.
If it is a scalar type, or on most targets an aggregate type whose only member object is of a scalar type, or a union type whose member objects are of scalar types, the expression is interpreted by GCC as a read of the volatile object; in the other cases, the expression is only evaluated for its side-effects.
Anyway, many C compilers do not generate the load, so you should not rely on this construct, regardless of what is the right interpretation of the standard.

#### Using volatile to limit optimization

The C standard specifies that volatile accesses will not be reordered or removed, but CPUs have things like write buffers that reorder and removes writes. That functionality is usually disabled for the memory regions that contain memory mapped hardware, but it is enabled for the stack. This always confuses me when I see local variables being volatile (when not needed by longjmp) — why is it OK for the CPU to reorder memory operations when the compiler must not do it? What did the developer expect to achieve by adding volatile?

The idea is usually to use volatile to disable compiler optimizations. A typical example is cryptographic code that want to clear sensitive data from memory. The "obvious" way of clearing memory does not work
void getPassword(void) {
char pwd[64];
/* Checking of password, secure operations, etc. */
}
memset(pwd, 0, sizeof(pwd));
}

as the compiler may optimize away the memset, and the usual suggestion is to change the memset to something like
volatile char *p = pwd;
for (size_t i = 0; i < sizeof(pwd); ++i)
p[i] = 0;

This may generate store instructions, but it is not guaranteed to clear the memory! Consider for example a hypothetical architecture that invalidates the cache for the stack frame when popping the stack, instead of wasting power and memory bandwidth by writing the now useless data to the memory — the cache will be invalidated before the newly written zeros have reached main memory.

One other problem is that the compiler may have placed sensitive data in "anonymous" temporary space allocated on the stack (e.g. by spilling), in registers etc., as is discussed in "Zeroing buffers is insufficient":
I am absolutely certain that there is software out there which inadvertantly keeps an AES key sitting in an XMM register long after it has been wiped from memory.
It may still be prudent to try to clear the memory (and it does not hurt), but be aware that it is not guaranteed to help either...

#### Volatile and real world compilers

Compilers tend to have much code handling special cases. There are several reasons, such as
• Different operations can be optimized in different ways
• General optimizations are often expensive, so it makes sense to handle cheap cases separately in order to reduce the load on the general optimization pass, or to be able to do the cheaper cases more often
• The compiler may need different optimization strategies depending on surrounding code (e.g. depending on register pressure)
• Different target CPUs need different optimization strategies
It is easy to forget to check for "volatile" in some obscure corner case, and it is hard test in a way that finds all such problems, so it is not that uncommon to find cases where volatile accesses are reordered/removed.

One example of this is discussed in OpenSSL pull request #102:
#include <stddef.h>

static void memset_s(void * v, size_t n) {
volatile unsigned char * p = (volatile unsigned char *)v;
for (size_t i = 0; i < n; ++i) {
p[i] = 0;
}
}

void f1() {
unsigned char x[4];
memset_s(x, sizeof x);
}

void f2() {
unsigned char x[7];
memset_s(x, sizeof x);
}

Current GCC optimizes away the memset_s in f1, but keeps it in f2.

Both functions are optimized the same by the high-level optimizations — the memset_s is inlined, and the loop unrolled:
f1 ()
{
unsigned char x[4];

<bb 2>:
MEM[(volatile unsigned char *)&x] ={v} 0;
MEM[(volatile unsigned char *)&x + 1B] ={v} 0;
MEM[(volatile unsigned char *)&x + 2B] ={v} 0;
MEM[(volatile unsigned char *)&x + 3B] ={v} 0;
x ={v} {CLOBBER};
return;
}

f2 ()
{
unsigned char x[7];

<bb 2>:
MEM[(volatile unsigned char *)&x] ={v} 0;
MEM[(volatile unsigned char *)&x + 1B] ={v} 0;
MEM[(volatile unsigned char *)&x + 2B] ={v} 0;
MEM[(volatile unsigned char *)&x + 3B] ={v} 0;
MEM[(volatile unsigned char *)&x + 4B] ={v} 0;
MEM[(volatile unsigned char *)&x + 5B] ={v} 0;
MEM[(volatile unsigned char *)&x + 6B] ={v} 0;
x ={v} {CLOBBER};
return;
}

The difference is introduced when the GIMPLE IR above is lowered to the representation used for instruction selection: the compiler sees that x[4] is an int-sized non-volatile array on the stack, where all accesses are using known indices, and this is a special case where it is profitable to place the array in a register (that is then eliminated as its value is never used). The problem is that this is not a valid transformation unless all accesses are non-volatile, and the code forgot to check for this...

1. The hardware that invalidates the cache for a stack frame on function exit is not "hypothetical"; the Mill does that (http://millcomputing.com/docs/). Also, the data may not be in DRAM anyway, because the Mill supports "backless memory" that lives only in cache unless evicted.

On any platform, if I want to ensure that a memory region has been scrubbed I use a "flushingMemset(...)" function, which I implement separately for each platform. I wish it were part of libc.

1. memset_s() is part of the standard: http://en.cppreference.com/w/c/string/byte/memset

2. For an implementation to be useful for embedded or systems programming, it is necessary that there be a means of establishing a hard barrier such that the all operations that precede the barrier in source-code execution sequence also precede the barrier in the processor's logical execution sequence, and all those that follow the barrier in source-code execution sequence follow the barrier in logical execution sequence. A systems-programming compiler need not know nor care about what will happen past that point--that's the responsibility of the programmer using the compiler--but it's impossible to write robust code for things like memory paging without the ability to force the compiler to flush any variables it's holding in registers in ways the hardware can't possibly know anything about.

The simplest way for a compiler to support such languages, and one which will work with most code, is to treat all volatile accesses as though preceded and followed by a barrier. This will impede some optimizations that might otherwise have been useful, but if code spends any time waiting for I/O the performance cost will be minor, and this approach is most likely to be compatible with code that doesn't use any implementation-specific directives to exercise more precise control. If a program is going to use DMA buffers or other such constructs, having a barrier between code that writes to a buffer and code that starts a DMA operation will eliminate the need to use volatile semantics for all operations upon the buffer. A directive which could impose a barrier only to accesses using involving the buffer might allow more optimizations in calling code than would a global barrier, but an option to regard "volatile" accesses as barriers would allow code to work on any compiler supporting that option, rather than only on those supporting a particular form of targeted barrier.

3. For C++ DR 1054 "Lvalue-to-rvalue conversions in expression statements" changes the behavior from C++03 in C++11 and says there is an lvalue-to-rvalue conersion: http://open-std.org/jtc1/sc22/wg21/docs/cwg_defects.html#1054

Also see Stackoverflow question "Correct behaviour of trivial statements involving expressions with volatile variables?" where some of the history is laid out: https://stackoverflow.com/q/20242868/1708801