CONTACT   |   QUICK LINKS   |   SITEMAP     search
 
Computer Sciences
image aboutBSC computational earth life computer applications marenostrum image
Computer Sciences
POP performance prediction
Home > Computer Sciences > Performance Tools > Examples of Analysis > POP performance prediction pdf print
 
 

POP performance prediction. Detailed analysis with Paraver

stroke

Introduction

This page describes some studies carried out on the Parallel Ocean Program (POP) application with Paraver and Dimemas. The work has been performed in cooperation with Allan Snavely (SDSC) in relation to the PERC project.

In this study we analyzed the Dimemas simulation using Paraver to compare at detailed level the real run instrumented with OMPItrace and the run simulated with Dimemas. The objective was to use Paraver analysis power to adjust the Dimemas parameters.

The initial tracefiles of Paraver and Dimemas were obtained by Nicole Wolter (SCSD) from a run in their SP machine Blue Horizon with a small case of POP with 4 processes.


Initial simulation

The first step we have done is to run Dimemas with the parameters that were initially used at SDSC: bandwidth = 350MB/s and latency = 19us. The Dimemas prediction was slightly optimistic: 218.32s real versus 204.71s prediction (error of 6%).

We had a look on the Paraver tracefiles to find where were the differences. A first general idea can be obtained comparing visually both tracefiles. The captured images display one line per MPI task where the coloring represents the state (dark blue = running, red = bloking, orange = global operation....) and the yellow lines mark the point to point communications.

  • The whole run with the same scale (the Dimemas run is a little bit shorter)




  • One iteration (at the same scale)




  • The details of few communications (showing the same interval of duration)




So we see that Dimemas is being very optimistic with respect to the communications prediction. The difference on the global operation is not as significant as the differences on the point to point communications. The reason why the error that was introduced in the simulated total time was not high is because the weitgh of the computation phase is very important.

We can use Paraver to obtain some measures like:

  • The time running: in average it was 199,626s for the real run and 195,518s for the simulated run. This small difference can be caused by the instrumentation interferences or just by the fact of being two different runs.

  • The average duration of some MPI calls: In both tracefiles, the longest MPI call is MPI_Wait. For this call the duration of the first MPI process is significantly smaller than for the other tasks, showing a little bit of unbalance in the code before the call.

MPI_Wait

Real run

Simulated run

Task 1

3,842us

366us

Task 2

64,854us

9,900us

Task 3

68,835us

10,516us

Task 4

73,256us

11,127us

Avg value

52,492us

7,977us


All the predictions of the MPI calls have an important percentage of error, the following table displays some examples:

MPI call

Real run

Simulated run

MPI_WaitAll

506.91us

353.35us

MPI_Bcast

694.66us

381.62us

MPI_Allreduce

386.13us

205.73us

MPI_Allgather

1144.86us

829.66us


The fact that the real time for all the primitives is greater than their simulation time can suggest that the effective bandwidth achieved is smaller than 350MB/s and that maybe the implementation of all the collectives could not be adjusted with a logarithmic model.


Adjusting Dimemas parameters


From the results obtained in the first analysis we decided to run a simulation reducing the bandwidth to 1/10th of the initial value (35MB/s). In this case, the simulated time was closer to the real run 207.66s.

If we look now at the graphical view, we see that the simulation of the point to point communications is getting closer to the reality, but the collectives are still quite optimistic.





We can verify what happened with the MPI calls average duration in the new run:

MPI call

Real run

Simulated run (Bw = 350MB/s)

Simulated run (Bw = 35MB/s)

MPI_Wait

52492.17us

7977.78us

13482.85us

MPI_WaitAll

506.91us

353.35us

689.98us

MPI_Bcast

694.66us

381.62us

383.41us

MPI_Allreduce

386.13us

205.73us

168.73us

MPI_Allgather

1144.86us

829.66us

826.57us


We have to keep in mind that Dimemas does not try to fit the simulation of each MPI call independently, but to model the communications that in some cases can involve more than one MPI primitive.

We can see that the simulation of the MPI_Wait and MPI_WaitAll has improved but the collective operations are still quicker on the simulated run. This fact suggested us to do a new simulation with the same bandwidth but a more detailed model for the collectives.

We used the collectives configuration file we use for the SGI Origin2000 platforms but modeling the broadcast, allreduce and allgather primitives following a linear model. With this new run the elapsed time is 211.98s. As we had a little bit more than 4 seconds of difference in the running time between the Dimemas and the Paraver tracefiles, the difference due to the communication modeling is around 2 seconds.

Comparing the instrumented run with the last simulation we can see:

  • At the level of few communications (same scale)




  • At the level of one iteration (same scale)




  • And comparing the MPI calls average duration

MPI call

Real run

Simulated run (Bw = 350MB/s)

Simulated run

(Bw = 35MB/s)

Simulated run

(Bw = 35MB/s,

collectives model)

MPI_Wait

52492.17us

7977.78us

13482.85us

13482.85us

MPI_WaitAll

506.91us

353.35us

689.98us

689.98us

MPI_Bcast

694.66us

381.62us

383.41us

3405.18us

MPI_Allreduce

386.13us

205.73us

168.73us

346.81us

MPI_Allgather

1144.86us

829.66us

826.57us

1010.51us



We can consider that this last simulation is good enough, but we can also continue doing validation and adjusting others parameters, for instance checking if a latency of 19us is a good approximation or we can improve it.


Ideal environment simulation

We run a simulation with zero latency and infinite bandwidth. This is an unrealistic simulation that gives us an idea of the application limits. In this case the total time is 202.21s.

The equivalent speed up (total computing time divided by total time) is 3.87. The fact that we cannot reach a speed up of 4 with unlimited resources denotes that there is a little bit of computation unbalances between the MPI tasks. As the difference is very small (0.03) we can say that the unbalances of the application are not significant.

Looking for application unbalances

Nevertheless we can again use Paraver to try to find some details about these unbalances. If we look at the iteration level, we see that the 3rd MPI task use to finish its computation work (dark blue) a little bit before the other tasks. We can think on obtaining a tracefile with hardware counters to check if the unbalance is because this task executes less instructions than the others or despite it executes the same number of instructions the IPC achieved is smaller.



Using the 2d-analyzer module of Paraver we can obtain a histogram of the duration of the CPU bursts.



The columns represent intervals of the CPU burst durations (increasing the value from left to right), and the coloring of the cells the number of times the CPU burst last within a given duration interval. With the current limits sets when this window was captured, yellow represents durations that only happened once, orange durations that happened more than 7 times, and between 2 and 7 there is a gradient represented from light green to dark blue.

We can see at the right side the unbalance that we detected visually. There is also another unbalance close to the middle of the image, in this case the task 3 is the one that has the longest bursts. If we compute the application balance, these differences will compensate and the values that we will obtain will suggest us that the application was more balance than it really was.

Conclusions

This example shows how interoperation between Paraver and Dimemas is an extremely powerful tool to understand the behavior of MPI applications both at the macroscopic and microscopic level. Comparing an actual run with a modeled run enables the analyst to better understand the behavior of applications and the systems where they run.

Through the example we describe an iterative approach to fit the model parameters to match the observed performances of real applications. This is a very powerful mechanism to understand the actual behavior of a system as oposed to blindly rely on nominal values provided by the manufacturer or obtained through simple benchmarks. The parallel machine model of Dimemas is an abstraction of the communication process rather that the exact representation of a physical architecture. It is not even only hardware, but also run time and system issues what is captured by the model in a global way. It is also important for an analyst to always be aware that simple microbenchmarks may not be representative of real applications loads and may not experience the same levels of contention (software and hardware). These considerations are reflected in the analysis done in this study as a significant reduction in the effective bandwidth observed in the communication phases with respect to what a ping-pong benchmark would report.

The example shows what we think is the most adecuate way to jointly use Paraver and Dimemas. The iterative process interleaves the use of both tools and also the visual perception (through the Paraver visualization module) with the quantitative measurements (global numbers reported by Dimemas or more detailed statistics computed by the Paraver analysis module.)
 
  top
link_top
  Barcelona Supercomputing Center, 2010 - Legal Notice
 
link_top