Runtime Features

In this section we describe the responsibilities and features of the SMP superscalar runtime.

Data dependency control

Tasks in a program operate on data that is generated and consumed from task to task. These relations define a certain control flow that must be respected in order to execute the tasks and obtain the same results as in their corresponding sequential execution. SMP superscalar guarantees the consistency of the results by respecting the data dependencies between tasks. Dependency information is generated and kept in a task dependency graph at run time.

Task dependencies are calculated by analysing the direction (input, output or both), length and address of each parameter against those of previous tasks in sequential order. There are three kinds of data dependencies. Read after Write dependencies (RaW) are those between a task that reads data and the task that has written it. Write after Write (WaW) dependencies are those between a task that writes to a data location and a task that has previously written to it. Finally, write after read (WaR) dependencies are those between a task that writes to data and another that read it earlier.

The runtime is structured in a main thread that runs the non task code and populates the graph and a series of worker threads that consume and execute the tasks from the graph. As tasks are executed, their output parameters becomes available to the tasks that were dependant on them.

Data dependency reduction

Dependencies are one of the factors that determine how much parallelism can be extracted out of an application. Superscalar processors try to reduce dependencies between instructions by performing register renaming. This technique is also used by SMP superscalar at run time in order to reduce task dependencies and increase the parallelism.

The renaming technique consists in storing temporary definitions of a program variable into temporary storage. That is, if a task writes to an array, renaming can replace that array by a temporary one and redirect all following reads of that definition to the temporary array. This effectively eliminates all WaR and WaW dependencies.

One clear disadvantage of renaming is that it increases the amount of memory usage. In order to limit the effects on memory usage, whenever a parameter is used as both input and output in a task and its input definition is not used by any other task, we do not rename it and instead allow its output definition to reside in the same memory location as its input definition.

This enhancement reduces memory usage considerably in applications that perform many accumulations on parameters. This is the case of the dense blocked matrix multiplication, in which this technique avoids renaming altogether and allows all calculations to be performed in place.

Workload distribution

One of the goals of SMP superscalar is to provide good performance. In these sense, the scheduling algorithm is designed according to three principles. First, trying to maximise the paralleelism. Second, trying to make task execution fast on the processors. And third, do not exceed the benefits of a simpler scheduling algorithm by applying a more computationally expensive one.

Our algorithm exploits data locality by taking advantage of the information in the graph. Since one task in the graph can only depend on another if the first consumes data generated by the second, we take advantage of that relation and try to run them in the same thread following a pseudo depth first traversal order.

Each thread has a ready task list associated to it. The main thread is responsible of running the main program by going through the non task user code, analysing the data dependencies and adding the tasks to the graph. New tasks that have no input dependencies are added to the main thread ready list where they can be picked up by any thread. Whenever the main thread has to wait for tasks to finish, it contributes to advancing the execution by executing tasks in the same way as the worker threads do.

Worker threads look for ready tasks first in their own list, then on the main thread ready list and then on the other thread lists. When a thread finishes running a task, it pushes all the task successors that have become ready into its ready list. While worker threads consume tasks from their own list in LIFO order, they steal them from other threads in FIFO order. That is, they consume the graph in a depth first order as long as they can can get ready tasks, and then steal tasks from other threads in a breadth first order when their ready lists become empty.

The idea behind this design is that each thread will be executing tasks in a different region of the graph and have little interference with other threads as long as there are ready tasks in that region or there are unexplored zones in the graph. Otherwise they will steal work from other threads in a way that tries to minimise the effect on the cache locality of that thread.

The following figure shows the task dependency graph of an execution of the code shown previously with N=2. The shape and shade of each graph node indicates which CPU has it been executed on.




Tracing

Measuring is key to improving the performance. For this reason, SMP superscalar has an integrated tracing facility. This facility can be activated by recompiling the applications with the -t flag. Applications compiled with tracing enabled record a series of events during their execution. These events are dumped into a Paraver formatted trace at the end of the execution.

The events in the traces allow to analyse what tasks have been executing at any moment (their type and invokation order), what is doing every thread at any given moment, what is the distribution of the duration of each task, etc. The following figure shows the task types of an execution of the code shown in the previous sections. The block_macc task is show in blue, while the block_acc task is shown in white.






The following figure shows in different colors the task numbers. Greener shades correspond to lower numbered tasks and blue shaded correspond to higher numbered tasks.







Prev / Download SMPSs / Documents