Volatile in C
The main use case for
volatile
in C is memory mapped hardware where reads and writes may change the hardware state (even reads may change the state — it is common that e.g. a read of an interrupt register resets the value). This means that the compiler must not eliminate, introduce, or reorder memory accesses to such memory.
Real world hardware have restrictions that makes it is hard to specify access rules in a way that works on all hardware. A few examples:
- The
original DEC Alpha processor could only do 32-bit load/store operations, so access to
an 8-bit variable must be done by reading 32 bits, modifying the value, and
writing back the result. This means that writing one element in
volatile char a[4];
will read and write all the elements.
- Unaligned load/store operations are split into two operations on the bus.
- The hardware may have restrictions on how read-modify-write instructions are used for memory mapped hardware, so e.g. incrementing a volatile variable as
a++
may need to be done as a load, an addition, and a store, even if the CPU has an instruction that increments a value in memory.
The standard solves these problems by saying that accesses to volatile objects need to be done as in the source code, but it is up to the implementation to define what constitutes an "access". The text of this is in C11 6.7.3 "Type qualifiers":
An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine, as described in 5.1.2.3. Furthermore, at every sequence point the value last stored in the object shall agree with that prescribed by the abstract machine, except as modified by the unknown factors mentioned previously. What constitutes an access to an object that has volatile-qualified type is implementation-defined.
This means that an implementation can do essentially anything for
volatile
as long as it can cover this by a definition of "access", but even a reasonable definition may produce surprises. For example, the text above talks about "at every sequence point". This means that the compiler is allowed to optimize
volatile int *p = somevalue;
int s = *p + *p;
as if it said
volatile int *p = somevalue;
int tmp = *p;
int s = tmp + tmp;
as both reads of
*p
are between sequence points. A similar case is when the value of an assignment expression is used, such as
volatile int a, b;
a = b = 0;
The standard does not say if
b
is read, i.e.
the compiler is allowed to compile this as
b = 0; a = 0;
or
b = 0; a = b;
The examples above are probably not that common, but even trivial cases may be surprising. Consider the load
*p
in
volatile int *p = somevalue;
*p;
C++ compilers
must not generate a load for this case (because of the C++ rules regarding lvalue-to-rvalue conversion). Many developers argue that C compilers must generate a load, but I'm not completely convinced. And, for example, the
GCC 5.3.0 manual describes this as if the compiler has a choice:
According to the C standard, such an expression is an rvalue whose type is the unqualified version of its original type, i.e. int
. Whether GCC interprets this as a read of the volatile object being pointed to or only as a request to evaluate the expression for its side-effects depends on this type.
If it is a scalar type, or on most targets an aggregate type whose only member object is of a scalar type, or a union type whose member objects are of scalar types, the expression is interpreted by GCC as a read of the volatile object; in the other cases, the expression is only evaluated for its side-effects.
Anyway,
many C compilers do not generate the load, so you should not rely on this construct, regardless of what is the right interpretation of the standard.
Using volatile to limit optimization
The C standard specifies that volatile accesses will not be reordered or removed, but CPUs have things like write buffers that reorder and removes writes. That functionality is usually disabled for the memory regions that contain memory mapped hardware, but it is enabled for the stack. This always confuses me when I see local variables being volatile (when not needed by
longjmp
) — why is it OK for the CPU to reorder memory operations when the compiler must not do it? What did the developer expect to achieve by adding
volatile
?
The idea is usually to use
volatile
to disable compiler optimizations. A typical example is cryptographic code that want to clear sensitive data from memory. The "obvious" way of clearing memory does not work
void getPassword(void) {
char pwd[64];
if (GetPassword(pwd, sizeof(pwd))) {
/* Checking of password, secure operations, etc. */
}
memset(pwd, 0, sizeof(pwd));
}
as
the compiler may optimize away the memset
, and the usual suggestion is to change the
memset
to something like
volatile char *p = pwd;
for (size_t i = 0; i < sizeof(pwd); ++i)
p[i] = 0;
This may generate store instructions, but it is not guaranteed to clear the memory! Consider for example a hypothetical architecture that invalidates the cache for the stack frame when popping the stack, instead of wasting power and memory bandwidth by writing the now useless data to the memory — the cache will be invalidated before the newly written zeros have reached main memory.
One other problem is that the compiler may have placed sensitive data in "anonymous" temporary space allocated on the stack (e.g. by spilling), in registers etc., as is discussed in "
Zeroing buffers is insufficient":
I am absolutely certain that there is software out there which inadvertantly keeps an AES key sitting in an XMM register long after it has been wiped from memory.
It may still be prudent to try to clear the memory (and it does not hurt), but be aware that it is not guaranteed to help either...
Volatile and real world compilers
Compilers tend to have much code handling special cases. There are several reasons, such as
- Different operations can be optimized in different ways
- General optimizations are often expensive, so it makes sense to handle cheap cases separately in order to reduce the load on the general optimization pass, or to be able to do the cheaper cases more often
- The compiler may need different optimization strategies depending on surrounding code (e.g. depending on register pressure)
- Different target CPUs need different optimization strategies
It is easy to forget to check for "volatile" in some obscure corner case, and it is hard test in a way that finds all such problems, so it is not that uncommon to find cases where volatile accesses are reordered/removed.
One example of this is discussed in
OpenSSL pull request #102:
#include <stddef.h>
static void memset_s(void * v, size_t n) {
volatile unsigned char * p = (volatile unsigned char *)v;
for (size_t i = 0; i < n; ++i) {
p[i] = 0;
}
}
void f1() {
unsigned char x[4];
memset_s(x, sizeof x);
}
void f2() {
unsigned char x[7];
memset_s(x, sizeof x);
}
Current GCC optimizes away the
memset_s
in
f1
, but keeps it in
f2
.
Both functions are optimized the same by the high-level optimizations — the
memset_s
is inlined, and the loop unrolled:
f1 ()
{
unsigned char x[4];
<bb 2>:
MEM[(volatile unsigned char *)&x] ={v} 0;
MEM[(volatile unsigned char *)&x + 1B] ={v} 0;
MEM[(volatile unsigned char *)&x + 2B] ={v} 0;
MEM[(volatile unsigned char *)&x + 3B] ={v} 0;
x ={v} {CLOBBER};
return;
}
f2 ()
{
unsigned char x[7];
<bb 2>:
MEM[(volatile unsigned char *)&x] ={v} 0;
MEM[(volatile unsigned char *)&x + 1B] ={v} 0;
MEM[(volatile unsigned char *)&x + 2B] ={v} 0;
MEM[(volatile unsigned char *)&x + 3B] ={v} 0;
MEM[(volatile unsigned char *)&x + 4B] ={v} 0;
MEM[(volatile unsigned char *)&x + 5B] ={v} 0;
MEM[(volatile unsigned char *)&x + 6B] ={v} 0;
x ={v} {CLOBBER};
return;
}
The difference is introduced when the GIMPLE IR above is lowered to the representation used for instruction selection: the compiler sees that
x[4]
is an
int
-sized non-volatile array on the stack, where all accesses are using known indices, and this is a special case where it is profitable to place the array in a register (that is then eliminated as its value is never used). The problem is that this is not a valid transformation unless all accesses are non-volatile, and the code forgot to check for this...