TSM - Introduction in .Net Multithreading Programming (II)

Dan Sabadis - Team Lead @ SDL

In the previous article we laid out the foundation of what multithreading is and described the pillar of asynchronous programming in .Net, namely the IAsyncResult interface. We also provided a brief history of the evolution of the .Net multithreading abstractization, starting with the Thread class and ending with the async-await pattern (and keywords).

In this article we will elaborate on the topic, giving detailed .Net thread synchronization examples, focusing on the two modes the CPU is spending time, in any modern operating system: the user-mode and the kernel-mode. As usual, we start with the definitions.

Kernel Mode

In Kernel mode, the executing code has complete and unrestricted access to the underlying hardware. It can execute any CPU instruction and reference any memory address. Kernel mode is generally reserved for the lowest-level, most trusted functions of the operating system. Crashes in Kernel mode are catastrophic; they will halt the entire computer.

User Mode

In User mode, the executing code has no ability to directly access hardware or reference memory. Code running in User mode must delegate the system APIs to access hardware or memory. Due to the protection afforded by this sort of isolation, crashes in User mode are always recoverable. Most of the code running on a PC will execute in User mode.

As a first conclusion, whenever possible, a programmer should use User-mode synchronization constructs as they are significantly faster than the Kernel-mode constructs because they use special CPU instructions to coordinate threads. This means that the coordination occurs in hardware, which is what makes it fast. This also means that the Windows operating system never detects that a thread is blocked on a User-mode synchronization construct! Because a thread blocked on a User-mode construct is never considered blocked, the thread pool will not create a new thread to replace the temporarily blocked thread. In addition, these CPU instructions block the thread for an incredibly short period of time.

As usual with life, there is a trade-off and we have a downside with the User-mode threads: a thread that wants to acquire some resource, but can't get it, will spin in User mode! This potentially wastes a lot of CPU time, which would be better spent performing other work or even just letting the CPU go idle to conserve power.

The Kernel-mode constructs are provided by the operating system itself. As such, they require that the application's threads call functions implemented in the operating system kernel. Having threads transition from user mode to kernel mode and back incurs a big performance hit, which is why Kernel-mode constructs should be avoided. However, they do have a positive feature—when a thread uses a Kernel-mode construct to acquire a resource that another thread has, the OS blocks the thread, usually putting it in sleep-mode, so that it is no longer wasting CPU time.

A thread waiting on a construct might be blocked forever if the thread currently holding the construct never releases it. If the construct is a User-mode construct, the thread is running on a CPU forever, and we call this a livelock. If the construct is a Kernel-mode construct, the thread is blocked forever, and we call this a deadlock. Both of these are bad, but of the two, a deadlock is always preferable to a livelock, because a livelock wastes both CPU time and memory, whereas a deadlock wastes only memory.

In an ideal world, we'd like to have constructs that take the best of both worlds. That is, we'd like a synchronization construct that is fast and non-blocking (like the user-mode constructs) when there is no contention, but when there is contention for the construct, we'd like it to be blocked by the operating system kernel. Hybrid constructs that work like this do exist in the .Net framework and will be covered in the examples below.

There are two kinds of primitive user-mode thread synchronization constructs:

All the volatile and interlocked constructs require you to pass a reference (memory address) to a variable containing a simple data type.

Enough with the pure theory from now on let's start looking at some code samples:
internal sealed class ThreadsSharingData {

internal sealed class ThreadsSharingData {
  private Int32 m_flag = 0;
  private Int32 m_value = 0;

  // This method is executed by one thread
  public void Thread1() {
    // Note: These could execute in reverse order
    m_value = 5;
    m_flag = 1;
  }

  // This method is executed by another thread
  public void Thread2() {
    // Note: m_value could be read before m_flag
    if (m_flag == 1)
      Console.WriteLine(m_value);
  }
}

The problem with this code is that the compilers/CPU could rearrange the code in such a way as to reverse the two lines of code in the Thread1 method! After all, reversing the two lines of code does not change the intention of the method. The method needs to get a 5 in m_value and a 1 in m_flag. These kind of problems happen during the compiler optimization phase, so the issue will not even be reproducible in the DEBUG mode, but only in the RELEASE mode of a computer program! That means the output of this program is not 100% predictable, it will be 5 in the great majority of the cases, but it could also be 0 from time to time in the production mode.

In order to correct the problem with the code above, we will use the .Net static class Volatile (or alternatively the volatile keyword).

public static class Volatile {
  public static void Write(ref Int32 location, 
    Int32 value);

    public static Int32 Read(ref Int32 location);
}

These methods are special. In effect, these methods disable some optimizations usually performed by the C# compiler, the JIT compiler, and the CPU itself. Here's how the methods work:

Therefore, now we can fix the ThreadsSharingData class by using these methods:

internal sealed class ThreadsSharingData {

internal sealed class ThreadsSharingData {
  private Int32 m_flag = 0;
private Int32 m_value = 0;

// This method is executed by one thread
public void Thread1() {
// Note: 5 must be written to m_value before 1 
// is written to m_flag
m_value = 5;
Volatile.Write(ref m_flag, 1);
}

// This method is executed by another thread
public void Thread2() {
  // Note: m_value must be read after m_flag is read
  if (Volatile.Read(ref m_flag) == 1)
    Console.WriteLine(m_value);
}
}

If the code above is slightly confusing, it can be summarized as a simple rule:

When threads are communicating with each other via shared memory, write the last value by calling Volatile.Write and read the first value by calling Volatile.Read.

Using the volatile keyword, we can rewrite and simplify the ThreadsSharingData class as follows:

internal sealed class ThreadsSharingData {
private volatile Int32 m_flag = 0;
private Int32 m_value = 0;

// This method is executed by one thread
public void Thread1() {
// Note: 5 must be written to m_value before 1 is // written to m_flag
m_value = 5;
m_flag = 1;
}

// This method is executed by another thread
public void Thread2() {
// Note: m_value must be read after m_flag is read
if (m_flag == 1)
    Console.WriteLine(m_value);
}
}

Volatile's Read method performs an atomic read operation, and its Write method performs an atomic write operation. That is, each method performs either an atomic read operation or an atomic write operation. Let's look next at the static System.Threading.Interlocked class's methods.

Each of the methods in the Interlocked class performs an atomic read and write operation. In addition, all the Interlocked methods are full memory fences. That is, any variable writes before the call to an Interlocked method execute before the Interlocked method, and any variable reads after the call execute after the call. That means the compiler does not have full liberty to rearrange the order or read/write operations of the variables that are passed as parameters to the Interlocked methods.

As the static methods that operate on Int32 variables are by far the most commonly-used methods, a few of these taking the Int32 as parameter are shown below. There are also overloads of the preceding methods that operate on Int64 values. Furthermore, the Interlocked class offers Exchange and CompareExchange methods that take Object, IntPtr, Single, and Double, and there is also a generic version in which the generic type is constrained to class (any reference type):

public static class Interlocked {
// return (++location)
public static Int32 Increment(ref Int32 location);

// return (--location)
public static Int32 Decrement(ref Int32 location);

// return (location += value)
// Note: value can be a negative number allowing 
// subtraction
public static Int32 Add(ref Int32 location, 
                        Int32 value);

// Int32 old = location; location = value; 
// return old;
public static Int32 Exchange(ref Int32 location, Int32 value);

// Int32 old = location;
// if (location == comparand) location = value;
// return old;
public static Int32 CompareExchange(
 ref Int32 location, Int32 value, Int32 comparand);
...
}

Great as they are, the Interlocked methods mostly operate on Int32 values. What if one needs to manipulate a bunch of fields in a class object atomically? In this case, we need a way to stop all threads but one, from entering the region of code that manipulates the fields. Using Interlocked methods, we can build a thread synchronization lock:

internal struct SimpleSpinLock {
  private Int32 m_ResourceInUse; 
  // 0=false (default), 1=true

  public void Enter() {
  while (true) {
    // Always set resource to in-use
    // When this thread changes it from not in-use,     
    // return
    if (Interlocked.Exchange(
      ref m_ResourceInUse, 1) == 0) return;

    // Black magic goes here...
    }
  }

  public void Leave() {
    // Set resource to not in-use
    Volatile.Write(ref m_ResourceInUse, 0);
  }
}

And here is a class that shows how to use the SimpleSpinLock.

public sealed class SomeResource {
  private SimpleSpinLock m_sl = new SimpleSpinLock();

  public void AccessResource() {
m_sl.Enter();
// Only one thread at a time can get in here to access the resource...
m_sl.Leave();
  }
}

The SimpleSpinLock implementation is very simple. If two threads call Enter at the same time, Interlocked.Exchange ensures that one thread changes m_resourceInUse from 0 to 1 and sees that m_resourceInUse was 0. This thread then returns from Enter, so that it can continue executing the code in the AccessResource method. The other thread will change m_resourceInUse from a 1 to a 1. This thread will see that it did not change m_resourceInUse from a 0, and this thread will now start spinning continuously, calling Exchange until the first thread calls Leave!

The big potential problem with this lock is that it causes threads to spin when there is contention for the lock. This spinning wastes precious CPU time, preventing the CPU from doing other, more useful work. As a result, spin locks should only ever be used to guard regions of code that execute very quickly.

The Kernel-mode constructs are much slower than the User-mode constructs. This is because they require coordination from the operating system itself. Moreover, each method call on a Kernel object causes the calling thread to transition from managed code to native User-mode code to native Kernel-mode code, and then return all the way back. These transitions require a lot of CPU time and, if performed frequently, can adversely affect the overall performance of your application.

However, the kernel-mode constructs offer some benefits over the primitive user-mode constructs, such as:

The System.Threading namespace offers an abstract base class called WaitHandle. The WaitHandle class is a simple class whose sole purpose is to wrap a Windows kernel object handle. The .Net provides several classes derived from WaitHandle. All classes are defined in the System.Threading namespace. The class hierarchy looks like this:

WaitHandle
  EventWaitHandle
    AutoResetEvent
    ManualResetEvent
  Semaphore
Mutex

There are a few things to note about WaitHandle's class Wait-related methods:

Events are simply Boolean variables maintained by the Kernel. A thread waiting on an event blocks when the event is false and unblocks when the event is true. There are two kinds of events. When an auto-reset event is true, it wakes up just one blocked thread, because the kernel automatically resets the event back to false after unblocking the first thread. When a manual-reset event is true, it unblocks all threads waiting for it because the kernel does not automatically reset the event back to false; your code must manually reset the event back to false. The classes related to events look like this:

public class EventWaitHandle : WaitHandle {
  public Boolean Set(); 
  // Sets Boolean to true; always returns true

  public Boolean Reset(); 
  // Sets Boolean to false; always returns true
}

public sealed class AutoResetEvent : EventWaitHandle {
  public AutoResetEvent(Boolean initialState);
}

public sealed class ManualResetEvent : 
  EventWaitHandle {
    public ManualResetEvent(Boolean initialState);
}

As a metaphor, the difference between these classes it's like the difference between a tollbooth and a door. The ManualResetEvent is the door, which needs to be closed (reset) manually. The AutoResetEvent is a tollbooth, allowing one car to go by and automatically closing before the next one can get through. Or just imagine that the AutoResetEvent executes

WaitOne() and Reset() as a single atomic operation.

Using an auto-reset event, we can easily create a thread synchronization lock whose behavior is similar to the SimpleSpinLock class we displayed earlier:

internal sealed class SimpleWaitLock : IDisposable {
  private readonly AutoResetEvent m_available;
  public SimpleWaitLock() {
    m_available = new AutoResetEvent(true); 
    // Initially free
}

public void Enter() {
  // Block in kernel until resource available
  m_available.WaitOne();
}

public void Leave() {
  // Let another thread access the resource
  m_available.Set();
}

public void Dispose() { m_available.Dispose(); }
}

We would use this SimpleWaitLock exactly the same way that you'd use the SimpleSpinLock. In fact, the external behavior is exactly the same; however, the performance of the two locks is radically different. When there is no contention on the lock, the SimpleWaitLock is much slower than the SimpleSpinLock, because every call to SimpleWaitLock's Enter and Leave methods forces the calling thread to transition from managed code to the kernel and back—which is bad. But when there is contention, the losing thread is blocked by the kernel and is not spinning and wasting CPU cycles—which is good. Note also that constructing the AutoResetEvent object and calling Dispose on it also causes managed to kernel transitions, affecting performance negatively. These calls usually happen rarely, so they are not something to be too concerned about.

To give a better feel for the performance differences, let's look at the following code.

public static void Main() {
Int32 x = 0;
const Int32 iterations = 10000000; // 10 million
// How long does it take to increment x 10 million 
// times?

Stopwatch sw = Stopwatch.StartNew();
for (Int32 i = 0; i < iterations; i++) {
  x++;
}
Console.WriteLine("Incrementing x: {0:N0}", sw.ElapsedMilliseconds);
// How long does it take to increment x 10 million 
// times
// adding the overhead of calling a method that 
//  does nothing?

sw.Restart();
for (Int32 i = 0; i < iterations; i++) {
  M(); x++; M();
}
Console.WriteLine("Incrementing x in M: {0:N0}",   
  sw.ElapsedMilliseconds);
// How long does it take to increment x 10 million // times
// adding the overhead of calling an uncontended //
// SimpleSpinLock?

SpinLock sl = new SpinLock(false);
sw.Restart();
for (Int32 i = 0; i < iterations; i++) {
  Boolean taken = false; sl.Enter(ref taken); x++;   
  sl.Exit();
}
Console.WriteLine("Incrementing x in SpinLock:     
  {0:N0}", sw.ElapsedMilliseconds);

// How long does it take to increment x 10 million 
// times
// adding the overhead of calling an uncontended 
// SimpleWaitLock?

using (SimpleWaitLock swl = new SimpleWaitLock()) {
  sw.Restart();
  for (Int32 i = 0; i < iterations; i++) {
    swl.Enter(); x++; swl.Leave();
}
Console.WriteLine("Incrementing x in 
  SimpleWaitLock: {0:N0}", sw.ElapsedMilliseconds);
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static void M() { 
/* This method does nothing but return */ }

When I run the preceding code, I get the following output.

Incrementing x: 8 Fastest
Incrementing x in M: 69 ~9x slower
Incrementing x in SpinLock: 164 ~21x slower
Incrementing x in SimpleWaitLock: 8,854 ~1,107x slower

As one can clearly see, just incrementing x took only 8 milliseconds. To call empty methods before and after incrementing x made the operation take nine times longer! Then, executing code in a method that uses a user-mode construct caused the code to run 21 (164 / 8) times slower. But now, see how much slower the program ran using a Kernel-mode construct: 1,107 (8,854 / 8) times slower.

So, if we can avoid thread synchronization, we should! If you need thread synchronization, then try to use the User-mode constructs. Always try to avoid the Kernel-mode constructs.

The following example will present the hybrid lock approach, which borrows the best of both worlds: at the beginning we try the User-mode synchronization, and if it fails (another thread already acquired the lock, then fallback to the safer and no spinning Kernel-mode lock):

internal sealed class SimpleHybridLock : IDisposable {
// The Int32 is used by the primitive user-mode 
// constructs (Interlocked methods)
private Int32 m_waiters = 0;

// The AutoResetEvent is the primitive kernel-mode 
// construct
private readonly AutoResetEvent m_waiterLock = new AutoResetEvent(false);

public void Enter() {
  // Indicate that this thread wants the lock
  if (Interlocked.Increment(ref m_waiters) == 1)
    return; 
    // Lock was free, no contention, just return

    // Another thread has the lock (contention), 
    // make this thread wait
    m_waiterLock.WaitOne(); 
    // Bad performance hit here
    // When WaitOne returns, this thread now has 
    // the lock
}

public void Leave() {
  // This thread is releasing the lock
  if (Interlocked.Decrement(ref m_waiters) == 0)
    return; 
    // No other threads are waiting, just return

    // Other threads are waiting, wake 1 of them
        m_waiterLock.Set(); 
    // Bad performance hit here
}
public void Dispose() { m_waiterLock.Dispose(); }
}

The SimpleHybridLock contains two fields: an Int32, which will be manipulated via the primitive User-mode constructs, and an AutoResetEvent, which is a primitive Kernel-mode construct.

To get great performance, the lock tries to use the Int32 and to avoid using the AutoResetEvent as much as possible. Just constructing a SimpleHybridLock object causes the AutoResetEvent to be created, and this is a massive performance hit compared to the overhead associated with the Int32 field. Later in this chapter, we'll see another hybrid construct (AutoResetEventSlim) that avoids the performance hit of creating the AutoResetEvent until the first time contention is detected from multiple threads accessing the lock at the same time. The Dispose method closes the AutoResetEvent, and this is also a big performance hit.

Although it would be nice to improve the performance of constructing and disposing a

SimpleHybridLock object, it would be better to focus on the performance of its Enter and Leave

methods because these methods tend to be called many, many times over the object's lifetime. Let's focus on these methods.

The first thread which calls Enter causes Interlocked.Increment to add 1 to the m_waiters field, making its value 1. This thread sees that there were zero threads waiting for this lock, so the thread gets to return from its call to Enter. The thing to appreciate here is that the thread acquired the lock very quickly. Now, if another thread comes along and calls Enter, this second thread increments m_waiters to 2 and sees that another thread has the lock, so this thread blocks by calling WaitOne using the AutoResetEvent. Calling WaitOne causes the thread to transition into the operating system's Kernel, and this is a big performance hit. However, the thread must stop running anyway, so it is not too bad to have a thread waste some time to stop completely. The good news is that the thread is now blocked, and so it is not wasting CPU time by spinning on the CPU, which is what the SimpleSpinLock's Enter method, introduced earlier, does.

Let's look at the Leave method: When a thread calls Leave, Interlocked.Decrement is called to subtract 1 from the m_waiters field. If m_waiters is now 0, then no other threads are blocked inside a call to Enter and the thread calling Leave can simply return. Again, think about how fast this is: leaving a lock means that a thread subtracts 1 from an Int32, performs a quick 'if' test, and then returns! On the other hand, if the thread calling Leave sees that m_waiters was not 1, then the thread knows that there is contention and that there is at least one other thread blocked in the kernel. This thread must wake up one (and only one) of the blocked threads. It does this by calling Set on AutoResetEvent. This is a performance hit, because the thread must transition into the Kernel and back, but this transition occurs only when there was contention. Of course, AutoResetEvent ensures that only one blocked thread wakes up; any other threads blocked on the AutoResetEvent will continue to block until the newly unblocked thread eventually calls Leave.

In conclusion, an application's overall performance can be improved by having a thread spin in user mode for a little while before having the thread transition to kernel mode. If the lock that the thread is waiting for becomes available while spinning, then the transition to Kernel mode is avoided. If one includes the SimpleHybridLock in the performance comparison above, it will obtain the same time results as with the SpinLock class.

The .Net framework provided quite a few such hybrid synchronization constructs (classes), the so called "Slim" classes, like: "ManualResetEventSlim", "SemaphoreSlim" and ReaderWriterLockSlim".

Finally we want to present the most-used hybrid thread synchronization construct which is the Monitor class. This class provides a mutual-exclusive lock supporting spinning, thread ownership, and recursion. This is the most-used construct because it has been around the longest, C# has a built-in keyword to support it, the just-in-time (JIT) compiler has built-in knowledge of it, and the common language runtime (CLR) itself uses it on your application's behalf. However, there are many problems with this construct, making it easy to produce buggy code, as we will see.

Every object on the heap can have a data structure, called a sync block, associated with it. A sync block has fields for a Kernel object, the owning thread's ID, a recursion count, and a waiting threads count. The Monitor class is a static class whose methods accept a reference to any heap object, and these methods manipulate the fields in the specified object's sync block. Here is what the most commonly used methods of the Monitor class look like.

public static class Monitor {
public static void Enter(Object obj);
public static void Exit(Object obj);

// We can also specify a timeout when entered the 
// lock (not commonly used):
public static Boolean TryEnter(Object obj, 
  Int32 millisecondsTimeout);

// I'll discuss the lockTaken argument later
public static void Enter(Object obj, 
  ref Boolean lockTaken);

public static void TryEnter(Object obj, 
 Int32 millisecondsTimeout, ref Boolean lockTaken);
}

Associating a sync block data structure with every object in the heap is quite wasteful, especially because most objects' sync blocks are never used. To reduce memory usage, the CLR team uses a more efficient way to offer the functionality just described: when the CLR initializes, it allocates an array of sync blocks in native heap. As discussed elsewhere in this book, whenever an object is created in the heap, it gets two additional overhead fields associated with it. The first overhead field, the type object pointer, contains the memory address of the type's type object. The second overhead field, the sync block index, contains an integer index into the array of sync blocks.

When an object is constructed, the object's sync block index is initialized to -1, which indicates that it doesn't refer to any sync block. Then, when Monitor.Enter is called, the CLR finds a free sync block in the array and sets the object's sync block index to refer to the sync block that was found. In other words, sync blocks are associated with an object on the fly. When Exit is called, it checks to see whether there are any more threads waiting to use the object's sync block. If there are no threads waiting for it, the sync block is free, Exit sets the object's sync block index back to -1, and the free sync block can be associated with another object in the future.

Suppose that we write a method like this:

private void SomeMethod() {
  lock (this) {
    // This code has exclusive access to the data...
  }
}
Codul de mai sus este echivalent, iar compilatorul generează codul ca mai jos:

private void SomeMethod() {
  Boolean lockTaken = false;
  try {
  // An exception (such as ThreadAbortException) 
  // could occur here...

  Monitor.Enter(this, ref lockTaken);
  // This code has exclusive access to the data...
  }
  finally {
    if (lockTaken) Monitor.Exit(this);
  }
}

The conditional exit from the 'finally' block ensures the resource is released only if the lock was previously taken (so no unexpected error/exception was raised prior to Enter call).

Useful links