This article discusses about atomicity, Interlocked
, and memory operation optimizations made by compilers and CPU and how they affect multi-threaded programs.
1. Atomicity
I guess most of you might have run across a lock
block which consists of a single operation, like this:
lock (someLockObject)
{
counter++; // counter is an int
}
If counter
might be accessed concurrently from another thread, the lock is necessary. This is because even a simple operation like increment is not atomic. It includes several steps: load the value into a register, increment it, save the value back to memory. The lock is there to ensure no other thread accesses counter
in the meantime.
There is an alternative to make simple operations like increment atomic: using the Interlocked
class. This class provides methods to do various operations atomically. These operations includes increment, decrement, adding/subtracting value, exchanging values, conditionally exchanging values, etc.. Hence, the snippet could be rewritten as:
Interlocked.Increment(ref counter);
Using Interlocked
is simpler, and, in most cases, yields better performance. However, as the lock
is often used to guard a block of several statements, you should replace lock
with Interlocked
only if every access to counter
is done within a lock
block which contains just a single replaceable operation. If that complicates your source code, stay with lock
.
Check out Interlocked's documentation for details about what operations it supports.
2. Optimizations
According to section I.12.6.6 of the CLI Standard, read or write of a size no larger than the native word size is atomic. That means read or write of a byte
, bool
, char
, int
, float
or reference-type value is always atomic. If so, are the two following snippets equivalent?
// Snippet 1
ready = true; // atomic operation
// Snippet 2
Interlocked.Exchange(ref ready, true);
The answer is no.
It turns out that atomicity alone is often not enough. This is because, for performance reason, the CLI Standard allows memory operation (read/write) optimizations, provided that they do not break the program if it is a single-threaded one. However, such optimizations could break a multi-threaded program if a second thread accesses the same memory locations concurrently.
Note: These optimizations might happen at software level (C# compiler, CLR's just-in-time compiler, CLR's native image generator) and/or hardware level (CPU and memory system). In practice, not all permitted optimizations are utilized. Still, you should always stick to the CLI Standard rather than relying on implementation details, otherwise your code might not run correctly on future software and hardware.
Here is an example of optimization and how it breaks a multi-threaded program.
int data;
bool ready;
void OptimizeMe()
{
// read data for some purpose
var tmp = data // read 1
var thread2 = new Thread(() =>
{
data = 1; // write 1
ready = true; // write 2
});
thread2.Start();
while (!ready); // read 2
Console.Write(data); // read 3
}
If no optimization is applied, one would say:
- The while loop at line 16 never makes the program hang, since
thread2
will execute write 2 sooner or later. - The program never prints
0
.
However, when taking possible optimizations into account, anything can happen.
- The JIT compiler might attempt hoisting the read of
ready
out of the while loop at line 16. In other words, it decides to readready
just once and saves for subsequent iteration instead of repeatedly reading it. This optimization causes update fromthread2
is ignored, and the program hangs. - On a multiprocessor machine, each thread might run on a different processor. For performance reason, each processor has its own private cache. To keep caches consistent between processors, caches have to talk not only to main memory but also to caches on other processors. That takes time and cause the CPUs to wait. To reduce CPU's wait time and utilize bus traffic, modern multiprocessor systems might buffer or queue tasks for processing later when more resources are available. This makes operations executed in a specific order by one processor are visible to other processors in a different order. Back to our example, let's consider 2 scenarios:
- On
thread2
's CPU, write 1 is buffered and write 2 is completed. Main thread running on a different CPU then observes thatready
istrue
whiledata
is0
(default value), and it prints0
. - Both write 1 and write 2 are completed. Main thread then fetches the fresh value of
ready
but use the stale value ofdata
from its cache (this cached copy was stored by read 1). Thus, it prints the stale value,0
.
- On
3. Back to Safety
Fortunately, memory operation optimizations could be disabled when needed.
Memory barriers (also called memory fences) are the lowest-level primitives designed to control memory operation reordering. A full memory barrier guarantees memory operations issued prior to the barrier are executed before the operations issued after it. On a multiprocessor system, a full memory barrier causes the processor executing it to complete all buffered writes and discard or refresh all stale cached values before going on to execute the next operation. C# supports 2 types of barrier: full barrier (via Interlocked.MemoryBarrier) and half barrier (via volatile
keyword and Volatile class). However, as memory barriers are very tricky and hard to use correctly (especially the "half" one), we mere mortals should generally stay away from them. If you like to dig deeper into memory barriers, there are some detailed articles on the Internet, like this one.
Note: Traditionally, some common patterns are often implemented using volatile
fields, such as lazy initialization patterns and polling for cancellation flag. Since .NET Framework 4.0, you could use Lazy<T>, LazyInitializer, and CancellationToken instead.
Every Interlocked
operation, in addition to being atomic, also has the effect of a full memory barrier. The broken example above could be fixed using Interlocked
as following:
var thread2 = new Thread(() =>
{
data = 1; // write 1
Interlocked.Exchange(ref ready, true); // write 2
});
thread2.Start();
bool b;
do
{
Interlocked.Exchange(ref b, ready); // read 2
} while (!b)
Console.Write(data); // read 3
The Interlocked
operations guarantee write 1 completes before write 2, and read 2 completes before read 3, thus, the program never prints 0
. It also guarantees that read 2 will obtain the fresh value of ready
immediately whenever thread2 set it to true
. As a result, the problem is fixed.
Note: if you declare ready
as volatile, the program will never print 0
. However, it is still possible that thread2 already set ready
to true
but main thread keep using the stale cached value of ready
(in other words, write 2 get reordered with some calls of read 2). This does not break our specific program as main thread repeatedly polling for ready
which should becomes true
eventually (hardware buffers and queues should get processed sooner or later). However, it might be a problem in a more complicated application. That is why you should be very careful and better avoid memory barriers and volatile fields altogether unless you are a real expert in that field.
The lock
statement and other higher-level concurrency primitives provided by .NET framework use memory barriers and/or Interlocked
operations internally when necessary. Therefore, you don't need to worry about memory operation optimizations when using them correctly.
4. Recap
The article is getting quite long, so to summarize the main points:
- When it comes to memory access, atomicity is not enough because of the effect of software and hardware's memory operation optimization.
- The
Interlocked
class's operations are both atomic and optimization-free. You could use it instead of alock
block which contains just a single simple operation. AvoidInterlocked
in more complex circumstances. - Don't use memory barriers (including volatile fields) at all unless you are a real expert in that field. Instead, use
lock
and other high-level concurrency primitives. - Finally, just to remind, you don't need to worry about the issues discussed in this article unless your program is multi-threaded and two threads access a shared memory location concurrently.
Happy coding!