SMPSs Programming Model
In this section we discuss the SMP superscalar programming model and syntax.
Task Based Programming
SMP superscalar is a programming environment for parallel applications based on function level parallelism. In this model, the programmer selects a series of functions called tasks that will run in parallel. These functions are treated by the runtime as the unit of parallel computation.
Tasks are defined with a pragma annotation right before their function definition. This annotation indicates that the following function is a task and specifies the directionality of each of the task parameters. This is the syntax of a task:
Each parameter in the task construct/annotation that is an array must have its dimensions specified either in the function definition or in the construct.
The following code shows a task definition that operates on matrix blocks. It takes the first two parameters and adds their product to the third parameter.
#pragma css task input(A, B) inout(C)
void block_macc(double A[N][N], double B[N][N], double C[N][N]) {
int i, j, k;
for (i=0; i < N; i++)
for (j=0; j < N; j++)
for (k=0; k < N; k++)
C[i][j] += A[i][k] * B[k][j];
}
By combining the addresses and directionality of each parameter with those of previous task invocations, the runtime is capable of analysing the dependencies at run time. This is a major difference between our work and other programming models like OpenMP workqueue extensions. It removes the effort of analysing the data dependencies from the programmer and moves it into the supporting runtime library. Moreover, the runtime is aware of the data dependencies with enough detail that it can take advantage of them instead of being limited by them.
The following code invokes the block_macc task shown previously and a new block_acc task that accumulates the values from the first matrix block onto the second. While this code is simple, it has some data dependencies that would had required manual handling by the programmer under another programming models.
int i, j, k; for (i=0; i < N; i++) for (j=0; j < N; j++) for (k=0; k < N; k++) block_macc(bigA[i][k], bigB[k][j], bigC[i][j]); for (i=0; i < N; i++) for (j=0; j < N; j++) for (k=0; k < N; k++) block_macc(bigD[i][k], bigE[k][j], bigF[i][j]); for (i=0; i < N; i++) for (j=0; j < N; j++) block_acc(bigC[i][j], bigF[i][j]);
The actual task dependency graph when N = 2 is shown below. The tasks are labelled with their order in sequential order. The upper and middle row of the graph correspond to the block_macc tasks, while the lower row corresponds to the block_acc task. Note that even if the task invocations of the first double nested loop have dependencies between themselves (task 1 and 2, 3 and 4, 5 and 6, 7 and 8), the graph is capable of representing parallelism beyond the first two iterations of that loop.

While this code is simple and straightforward under a sequential point of view and has a good level of parallelism under our programming model, it presents some problems for data parallel and workqueue programming models that can only be overcome by taking into account the data dependencies and transforming the code accordingly.
Partial Synchronisation Points
While the underlying runtime is capable of handling all inter-task related data dependencies, it cannot handle dependencies with non task code. The best way to handle those cases is to create new tasks that encapsulate the relevant code, so that the runtime can take care of those dependencies. However, in some cases this is not possible or desirable, for example at the end of the program when writing the results of the whole execution to a file. Those cases can take advantage of synchronisation points.
Synchronisation points are partial kind of barrier. In SMP superscalar, they are associated to particular data that is going to be accessed. After a synchronisation point has been crossed, inline code is guaranteed to have all dependencies with the specified data resolved. In this regard, synchronisation points are partial, since they wait for specific values instead of waiting for all tasks to finish.
The syntax of the synchronisation point annotation is the following:
An example that writes the results of the code from the previous section follows:
int i, j;
for (i=0; i < N; i++)
for (j=0; j < N; j++) {
#pragma css wait on (bigF[i][j])
block_writeOut(bigF[i][j], outputFile);
}




