Comparing MPI vs MLP
Introduction
This example describes the analyses carried out with Paraver to quantify the performance difference between two versions of multilevel application that were supposedly identical although they used two different communication mechanisms.
Both used OpenMP at the innermost level. One version used MPI at the outermost level while the other used MLP (programming model developed at NASA AMES where shared memory regions are allocated to perform the communication between processes). The configuration analyzed used 64 processors (8 processes 8 threads each) in both cases. Three important features of Paraver are exposed through this example:
- The potential of the semantic module to focus on very specific parts of a program behavior. In our case some specific workshares.
- the usefulness of histograms computed through the 2D analysis module. In our case to identify the different behavior across threads.
- The capability to use performance models in an analysis. In this example used to generate a single view that can expose performance problems derived from memory placement issues.
The analysis shows how the different behavior of two programming models/communication mechanisms is not necessarily in the easily arguable different communication speeds. Many system issues do affect the actual application performance. In this case, the mapping of processes and memory pages to nodes is the cause of the difference.
Basic profiles
A typically used first view is user_functionsxth.cfg representing by a distinct color the time span of each of the user functions called by each thread. Views at the same time scale from both versions are shown in Figure 1 and Figure 2. There is a clear difference (~20% or even larger) in the elapsed time of user routines in both version as can be quantified by the 2D analysis module of Paraver and is reported in Table 1.
Figure 1: User functions called by each thread. MPI version

Figure 2: User functions called by each thread. MLP version

Table 1: average duration (ms) of user functions
Workshare duration
In order to investigate the behavior at a deeper level, the In_worksharing.cfg configuration file was used to expose when each thread is inside a workshare. The In_worksharing.cfg configuration file relies on the events generated by OMPItrace when entering and exiting a workshare. An interesting feature is that in many applications implementing pipelines across threads of an OpenMP process, the busy waits are performed by each thread before entering their corresponding section of the workshare.
Futhermore, we were only interested in when the above specified user functions are executing within a workshare. By multiplying the user_functionsxth.cfg view by the In_worksharing.cfg, we can get such view as displayed in Figure 3 and Figure 4. As can be seen, the set up of the pipelines has been wiped out (now appears in black even if a thread is inside a given user routine) as well as certain intervals through the computation.

Figure 3: User functions iff in worksharing. MPI version

Figure 4: User function iff in worksharing. MLP version
Using the composition and derivation functionalities of the semantic module of Paraver it is possible to construct views where only the workshares inside a specific user function are considered. It is also possible to derive timelines where the displayed magnitude is the duration of each specific workshare or cero in all regions outside the workshare. As an example, Figure 5 is a timeline of the duration of workshares within user function uf3. The Y scale has been set to min=30 ms (light green) and max=60 ms (dark blue). By focusing the scale in such a way, the difference between processors is apparent.

Figure 5 : Duration of workshares within uf3. MPI code.
Histograms of specific workshare duration
Applying the 2D analysis module to a view as the one in Figure 5 we can get a detailed statistical characterization of the behavior of a specific workshare. Figures Figure 6 and Figure 7 show the histograms for one of the workshares in both versions. In both of them, each row represents one thread. The horizontal axis represents duration of the workshare. Each column represents one bin in the histogram and corresponds to 1500 us. The metric displayed is total time spent by the thread inside a workshare whose individual duration falls in the corresponding bin. The histogram for the MLP version shows a bimodal distribution, where most of the workshares are in the order of 37ms while a few others are in the order of 46ms. The histogram for the MPI version shows a consistent and significant difference between certain groups of 4 processors. The two modes of the distribution have been shifted up for some groups, reaching values around 54ms and 77ms.
This imbalance in the duration of the parallel regions propagates through the pipeline and results in the holes observed in Figure 3 and the large difference in the duration of the user functions as reported in Table 1.

Figure 6 : Histogram of workshares duration within uf3 (MLP code)

Figure 7 : Histogram of workshares duration within uf3 (MPI code)
Load imbalance: time or instructions?
The following question that arises naturally is what causes the observed load imbalance in the MPI version. As far as the program developer knew, there should have been no such imbalance. In order to confirm the user assessment a trace with hardware counters (graduated instructions and L2 misses) was obtained for the MPI version of the code. The result of the profile made using the 2D module is shown in Figure 8.

Figure 8: Useful instructions in parallel dos and workshares
The instruction counts within each process do show a very small load imbalance. Threads 1 to 6 of each process have a few more instructions than threads 7 and 8. The difference is pretty small (<3%) but detectable and consistent across different functions. It can be interpreted as deriving from the iteration space not being multiple of the number of threads. Nevertheless, this imbalance does not match the observed imbalance in terms of duration of the workshares.
A similar analysis was done for the MLP code. As the user had expressed, we observed that the difference in instruction counts for both versions of the program were really small (<0.00002%).
Load imbalance: cache misses?
Having discarded the variation of computational cost as cause of the load imbalance, we then suspect of L2 cache misses. A similar 2D profile as above, but this time on L2 misses is shown in Figure 9 for the MPI code and Figure 10 for the MLP version. The gradient color encoding seems to indicate some difference across threads (due to the narrow range between Min and Max towards the bottom right corner of the 2D) but the numerical differences (and computed Stdev) are really small. It is even the case that the MPI version has fewer totals L2 misses for some functions. The small variation across threads is reflected in the timelines view of Figure 11.

Figure 9: Cache misses within parallel dos and workshares for the MPI version

Figure 10: L2 misses within parallel dos and workshares for the MLP version

Figure 11: L2 misses in the MPI version
When jointly looking at the time taken by a given function, its number of instructions and misses across threads we conclude that the same number of L2 misses take a fairly different time on different threads. For some reason, the thread to processor mapping and page to node mapping do result in fairly different access times for each of the processor groups. It is important to mention how the analysis of a single trace with hardware counters can give indications of memory/processor placement problems.
Memory cost model
A problem with the approach followed to detect the possible memory placement problem is that as analyst, we had to mentally correlate the views of time, instructions and misses. The question that arises is whether a single view could be built that helps identify the problem. This is the point where simple performance models can help even if they are not very accurate. We built a view estimating the cost of the L2 cache misses according to the following model that is applied to each interval between hardware counter samples:

The parameter in this expression that is not available in the trace is IdealMIPS, the MIPS that would be achieved by the code under ideal memory subsystem (zero L2 miss cost). Our current approach is to use an estimate of 500MIPS. In the current paraver configuration file, we also consider that ALL routines have the same ideal MIPS. I a very interesting future work we will obtain this number from the compiler estimates. Computing a profile of average L2missCost per user routine we get the result of Figure 12 for the MPI case and Figure 13 for the MLP case.
Figure 12 clearly identifies the different cost of L2 misses for the two groups of four threads within each process. Although a difference between threads is consistently detected across routines, the actual ratio is dependent on the routine. This may be caused by the assumption of the model that the ideal MIPS is the same for all of them being wrong. It may also be caused by an actual difference in the memory access pattern across routines with different local and remote access ratios.

Figure 12: Estimated L2 miss cost for the MPI code
Figure 13 also shows some difference between groups of four threads, although the actual values are much closer. It is interesting to observe that the 2D analysis lets us detect this behavior which was not apparent from the timeline views.
We can also see how the average L2 miss cost is significantly better in some user functions (1-blue, 3-red column) for the MLP code than for the MPI, but quite similar in others (2-white). The fact that also routine 2 takes more elapsed time in the MPI case shows how the imbalance of L2 miss cost across threads is harmful for the application as the delay accumulates at the synchronization points.

Figure 13: Estimated L2 miss cost for the MLP code
Conclusions
This document describes the process followed to identify the reason for the differences between two versions of a code for which the user was expecting similar performance. The study proceeds from the profiles that typically constitute the first steps of a performance analysis to deep detailed observation of very specific sections of the code. It is an example of how the flexibility in deriving performance indices and obtaining quantitative statistics (specially histograms) can be used to squeeze the is actually available in a single trace. Paraver is the type of tool that allows the analyst to always address the new question that every answer comes with.
In the problem analyzed, the reason for the difference in performance between versions is not related to application structure or actual communication mechanism. The precise profile information that OMPItrace and Paraver support lets the analyst to confirm/discard the possible causes that become suspects as the analysis progresses.
In this case, the typical suspects (Communication libraries performance, computational load imbalance, locality) can be successively discarded, until OS issues such a memory and process placement are identified as cause of the performance differences. Even if no measurement is directly available on the trace about the cost of memory access or the actual placement of processes and memory, this study shows how it is possible to carry out indirect observations that reflect causes of problems beyond the directly traced events. Furthermore, the powerful semantic module of Paraver can process and summarize the actually traced data in a single view that estimates the actual magnitude of those indirectly observed phenomena.




