Analysis of scalability problems
Introduction
The objective of this study was to find out why a certain F90
application did not scale as well as expected. The user suspected that
it was due to the call to intrinsic routines for matrix multiplication
and wanted to use Paraver to proof this.
In
this study we compared some basic metrics of the 1 process and 4
processes tracefiles to check if we can see some differences that will
justify the scalability problems.
Visual analysis
The
first recommended step is to have a visual look of the traces. In our
case we obtained the following views for the 4 threads trace: 
We
can see some unbalance between processes on the 3 main loops. This
unbalance is one of the reasons for the scalability problems, but we
can continue the analysis having a look on some metrics that
characterize the program and the performance of the run.
Comparing the program characteristics
The first objective is to check if the observed unbalance can be
confirmed at the level of instructions. We can see in in the next
figure that there is an important unbalance of work between threads.
This unbalance on the number of instructions is one of the problems to
achieve good scalability.

Now we can compare the number of instructions of the two tracefiles 
There is an important increase of the total instructions on the 3 main
loops (from 115000000 to 189000000 aprox.) while other loops maintain
the number of instructions. This increase of instructions will be
another reason of the poor scalability.
It
will be interesting to instrument within the big loops to identify
where is the increase of
instructions. To validate the users theory on the intrinsic calls, we
can check if these loops are the ones that call to the routines for
matrix multiplication
Comparing the performance behaviour We can also compare the performance of both runs. The suggested metrics to check are:
-
The number of instructions per us. As we can see in the next two figures there is an important reduction (aprox. from
400 to 270). This is the 3rd reason we found to explain the poor scalability results.


-
The number of L2 misses. In this case there is a very important increase that maybe will justify the reduction on the MIPS
ratio.

Conclusions
In this page we show the type of simple studies that can be performed
to compare to tracefiles. In the described example the objective is to
study the scalability problems of a program with different number of
processors.
The comparative analysis should be based on
applying to the different traces the same configuration files. The study starts with simple views
of parallel process activity that can reveal potential load balance. It is very important not to
stop there even if some cause of poor scalability is easily identified in the gaphical views. The
analyst should evaluate further potential causes. 2D profiles for each parallel function of metrics
derived from hardware counters information can be easily computed and may point out additional
performance problems.
In this example, we have identified that it is a combination of several
factors that leads to the scalability problem of the application. In
our case those factors are real load imbalance in terms of
instructions, increase in the number of instructions executed by the
algorithm and increase in the number of L2 misses.
|