CONTACT   |   QUICK LINKS   |   SITEMAP     search
 
Computer Sciences
image aboutBSC computational earth life computer applications marenostrum image
Computer Sciences
Performance analysis of the CAM code
Home > Computer Sciences > Performance Tools > Examples of Analysis > Performance analysis of the CAM code pdf print
 
 

Performance Analysis of the CAM code with Paraver

stroke

Introduction

This page describes some studies carried out on the Community Atmosphere Model (CAM) v2 application with OMPItrace and Paraver. The work has been performed in cooperation with Pat Whorley (ORNL) in relation to the PERC project. We describe the major findings encountered when using these tools to understand the application behavior.

We run many different configurations of the application: pure MPI and mixed MPI+OpenMP varying the total number of proceses and in Power3 and Power4 platforms. The study presented on this page was done using a tracefile for 16 MPI processes each of them single threaded. The trace was obtained with OMPItrace on the Regatta system of TACC.
The objective was to validate if Paraver is useful to analyze in a short timeframe a real application by someone that has no previous knowledge on it.

The first issue faced when tracing a large and unknown code is to identify the structure of the application. This may require to properly insert into the source code a call to the API to shutdown the tracing process after a few iterations. The source code was analyzed and the iterative loop identified.

Tracing was performed including reads of the hardware counters at the entry and exit each parallel function, their corresponding user functions and at the MPI calls. Before tracing the source code was also modified such that every certain number of iterations the API to switch the group of hardware counters monitored.

We obtained one trace for each of the above mentioned configurations. Each trace covers 15 iterations.


Qualitative analysis: timeline views

The following views show the user functions instrumented. Each color corresponds to one function. All MPI activity has been blackened out to focus the analysis on the user level code.

The views show different levels of load imbalance. Routine physpkg (purple) shows an important imbalance one iteration out of three in a repetitive pattern. Less important in absolute terms but showing an interesting behavior, routine scanslt (light orange) takes longer in the 2 first and last 2 processes. The situation is reversed for routine driver (yellow). It was interesting to discuss with a person understanding the algorithms and physics solved by the code to associate the periodic behavior of routine physpkg to the simulation of day or night periods. The behavior of routine scanslt relates to the different characteristics of the model in the areas close to the equator and the poles.

A large number of performance metrics can be derived from the hardware counters information obtained. The following views display some of these metrics. The displays are shown as gradients from light green for low values to dark blue for the high values (each metric has its own scale for low and high values). The views are synchronized with the view showing the user functions. The views appear as totally black in those time regions where the required hardware counters were not read.

The metrics shown are:

  • MFLOPS
  • MIPS
  • Ratio of FMAs to FPUs: Indicating characteristics of the application and the hability of the compiler to generate FMA instructions.
  • Total Bandwidth: Displaying in MB/s the total bandwidth used by the processor (coming from all levels of the memory hierarchy)











Quantitative analyses

Beyond the qualitative information provided by the timeline views, it is very important to provide the capability of quantitatively summarize and present the performance data. This is done in Paraver with the 2D analysis module. In its basic form it shows a table with one row per thread/process and one column per function. Each entry shows a statistic computed over the intervals where the thread was inside the function. The order of the columns can be selected by decreasing value of a given statistic.

In the figures below we present a set of profile statistics computed with this module. Columns appeared sorted by decreasing percentage of execution time. They all correspond to the analysis of the section of the trace shown in the above timelines. Each entry is colored using a gradient between light green for the smallest value in the table to dark blue for the largest.

First we show the profile of percentage of the total computing time (MPI excluded) of each thread taken by each routine. The imbalance in the first three routines can be observed.


The next two tables show some characteristics of the program, namely the percentage of floating point instructions over the total number of instructions inside each routine in the first table and the ratio of FMAs to FPUs that the compiler was able to generate in the second. It is apparent that dyndrv has few floating point instructions, but other routines are floating point intensive as one would expect.



In the next figure we can observe how the percentage of FMAs executed by the different routines varies significantly. It is interesting to observe the important variation within scanslt between the threads at both extremes and the central threads. The difference in physics and model between the poles and the equator not only reflects in more time as we saw before, but also in different characteristics of such computation.



The next group of tables looks at the memory bandwidth and which level of the memory hierarchy is actually serving such data. The table of total memory bandwidth shows imbalances between threads within one function. Interestingly, it shows someimbalance in routine dyndrv which was not apparent from previous statistics.



The next table shows how much of that bandwidth comes from memory. It is a very small part of the total.



The next table shows how much comes from L3. Not much either, but in both tables there is a significant difference between routines.



The last table related to the memory hierarchy show how much data comes from L2.



The last two tables show the real performance metric in which a user would be interested: MIPS and MFLOPS.






Conclusions

We were able to analyze the CAM code in a short time interval (one week dedicating 30% of the working time of one person, more than half of the time was dedicated to obtain the tracefiles).

Before this study we had no previous knowledge of the application and after obtaining and analyzing some tracefiles we were able to identify some patterns and unbalances of the application, that were confirmed and explained by Pat Worley:

  • Routine physpkg shows an important imbalance one iteration out of three in a repetitive pattern. The periodic behavior of routine physpkg is associated with the simulation of day or night periods.
  • Less important in absolute terms but showing an interesting behavior, routine scanslt takes longer in the 2 first and last 2 processes. The situation is reversed for routine driver. The behavior of routine scanslt relates to the different characteristics of the model in the areas close to the equator and the poles. It was observed an important variation on the percentage of FMAs executed within scanslt between the threads at both extremes and the central threads. The difference in physics and model between the poles and the equator not only reflects in more time, but also in different characteristics of such computation.
  • dyndrv has few floating point instructions, but other routines are floating point intensive as one would expect.
  • With respect to the memory bandwidth and which level of the memory hierarchy is actually serving such data, the table of total memory bandwidth shows imbalances between threads within one function. Interestingly, it shows some imbalance in routine dyndrv which was not apparent from previous statistics. The other tables demonstrate that most of the trafic comes from L2 level with a very small percentaje from L3 and Memory.

 
  top
link_top
  Barcelona Supercomputing Center, 2010 - Legal Notice
 
link_top