Blog Closed

This blog has moved to Github. This page will not be updated and is not open for comments. Please go to the new site for updated content.

Sunday, February 7, 2010

ParrotTheory: Threading

Threading is one of those technologies, buzzwords, that is supposed to be the future of computer programming. It's one of those things that is supposed to be all things to all people: improved scalability and performance, better utilization of hardware, and it can even cure the blind. The idea is that threading will allow applications to scale and take advantage of new multicore processors that companies like Intel and AMD are creating. Like any tool, threading can be a huge help to certain people for certain classes of applications. However, threading can also bring with it a large number of problems that actually hamper program performance. To understand this dichotomy it's important to understand what threads are and how they operate.

In a very narrow sense, a thread is an executing context. On a hardware platform like x86 we can define an executing context by knowing the contents of the processor registers, the contents of the stack, and the sequence of machine code instructions that the program is following. The instruction pointer (IP) register in the processor points to the current machine code instruction, and after a command is executed the pointer is updated to point to the next instruction. By taking a snapshot of a context and saving it, we can essentially "freeze" the state. We could save the context to disk, and re-load it later to continue where we left off.

A single-processor CPU core is a very linear device. Machine code instructions are executed in order starting at a start address and following linearly through memory. The one exception to this, branches, allow control to jump to arbitrary addresses where linear control continues. When you look at a modern single-core desktop machine it appears that things are happening concurrently and multiple programs are executing at the same time. This is only a facade. The OS breaks execution into time slices where programs execute one at a time. Moving control from one process to another is called a context switch. Each process is allocated individual slices of time, and if the slices are made small enough and context switches happen often enough, it appears to the user that things are happening together in parallel. It's a play on the limitations of human perceptions.

Threads can be preemptive or cooperative. A cooperative threading system, by far the least common type, passes control to a thread until that thread explicitly relinquishes it. A cooperative system has a significant benefit: the program knows when it's execution will be paused and can therefore avoid much non-deterministic behavior. Alternatively and considerably more common is preemptive multhreading where the OS controls thread context switches. The thread executes without any knowledge of other threads and without any notion of cooperation. At a seemingly-random time the OS halts the thread, saves it's current execution context, puts the thread into a queue, and loads the next thread to execute. This system brings the benefit that programs can be written without any knowledge of theads and still be run on multiprocessing systems. Also, we have the benefit that no one program can "hog" system resources: the OS makes certain that all programs get fair opportunity to execute. On the other hand, preemptive multithreading brings a certain amount of non-deterministic behavior and creates whole classes of problems like memory corruption and deadlocking that do not exist otherwise.

In a preemptive multithreading system (which is the most common and the only type I will consider here), each thread is a structure in the OS. Threads are managed by a process called the thread scheduler. The thread scheduler is typically implemented as an interrupt handler attached to a hardware timer device. When the hardware timer device sends a trigger signal to the CPU, the scheduler is executed. Here is an example of what a scheduler looks like:

interrupt void ThreadScheduler() {
ExecutionContext lastctx = SaveCurrentContext();
ExecutionContext nextctx = GetNextContext();
Schedule(lastctx);
LoadContext(nextctx);
}

When the scheduler activates the current execution context, which consists of the contents of the processor registers and ID of the memory pages assigned to that thread, we load the necessary pages back into memory and continue execution from the point where it was last interrupted as if nothing has happened. Multiple threads can be executing in a single process, in which case they share the memory pages with executable code in them. However, each thread is going to have it's own stack, and likely it's own heap pages as well. This is how multiple threads in a single process are differentiated. The current context is enqueued and the next executing context is popped off the queue.

Because of the necessary operations of saving the current context, enqueuing the current context, dequeueing the next context and loading the next context, threading is inherently slower for linear operations than non-threaded systems are. Threads always carry a performance cost in terms of context switching and a memory cost in terms of maintaining separate stacks and context structures. In addition, the more threads we have, the less frequently each individual thread runs. To understand why, assume we have a system that switches threads 10 times per second. With only one thread, it runs 100% of the time. With 10 threads, each only gets one tenth of every second to operate. With 100 threads, each thread only gets a one tenth-of-a-second opportunity to execute every 10 seconds. These are not necessarily a large cost (in fact in many systems it can be negligible), but it does exist. In exchange we gain the ability to simplify and encapsulate separate tasks, create the illusion of concurrency, and (most importantly for graphical systems) limit the pauses the user experiences while the system is processing another task.

On multiprocessor or multicore systems, we also gain the benefit that threads can run on separate processors, truly in parallel. In this way, as the number of processor cores increases, so too can the performance of an application improve if it uses enough threads to fill those processors. In these situations, a program can maximize it's throughput if it has as many executing threads as there are available processor cores to run them on. Too many threads and we experience costly context switches. Too few threads and processor cores lay unused.

Context switches can only happen at instruction boundaries. This means that an individual machine code instruction is atomic: a context switch can happen between machine code instructions but cannot happen in the middle of an instruction. However, beyond this guarantee there is no way for the program to determine ahead of time where in the program execution these switches will happen. This creates nondeterminism which can cause bugs.

So what kinds of bugs can be created? Let's consider a structure with two integer values which are defined as a psychotic redundancy measure. The two are supposed to always be copies of each other, and if they are ever different the program will freak out and crash. Here's an access routine to change the values at once:

modify_data(my_struct* s, int newvalue) {
s->data1 = newvalue
s->data2 = newvalue;
}

This seems straight forward. Now consider the (admittedly contrived) case where two threads are calling this function with the same structure pointer, and a context switch happens between the two statements. Thread 1 updates the data1 to 1234 and a context switch happens. Thread 2 updates the data1 and data2 to 4567, followed by another switch. Now thread 1 updates data2 to 1234, and the two values are now not equal. The structure is left in an inconsistent state, the program freaks out, the plane crashes, and all the orphan children die in a fire.

To avoid inconsistencies like this we can introduce any number of concurrency lock primitives, such as mutexes, semaphores, spinlocks, critical sections, or whatever else people create for the purpose. It's beyond the scope of this blog to talk about all these things individually, so I might save the discussion for another blog post later. Regardless of the exact method we use, the code now looks like this:

modify_data(my_struct* s, int newvalue) {
lock (lock_object) {
s->data1 = newvalue;
s->data2 = newvalue;
}
}

And the two threads are somehow prevented from both entering the function at the same time. If one thread tries to enter the lock before the other thread has left it, we force an immediate context switch to let the first thread finish and exit before the second thread can continue.

So here's another good rule about threads: In addition to the cost overheads of switching contexts and managing threads, there are also costs involved in managing access to shared resources. All these costs combined can really make threading a performance drain instead of a performance boon. Plus, the need to properly restrict access to shared resources puts a strain on programmers and can increase the number of bugs in programs. Threads are but one tool in the toolbox of a skilled programmer, and should be used with care.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.