![]() |
|
|
|
| Computer Sciences |
| MPI+OpenMP Performance Analysis tips |
| Home > Computer Sciences > Performance Tools > Documentation > MPI+OpenMP Performance Analysis tips |
|
|
MPI+OpenMP Performance Analysis tipsThis document poses some questions a performance analyst may face when looking at a Paraver trace. We describe how the tool can be used to identify specific behaviors or performance properties. Paraver is a generic browser of performance data stored in a generic trace format. Different tracing packages are targeted to different programming models or platforms. The actual analysis that can be performed is dependent on the type of information captured and emitted by the tracing package. This document concentrates on the analysis of traces produced by the OMPItrace package for tracing MPI and OpenMP programs. Because traces for real programs and platforms can be very large, a set of utilities have been developed to summarize the information. These utilities generate events with some encoding convention. This document also describes how to use and analyze the scalability features supported by such utilities. A separate document is devoted to each of the other encoding conventions used by CEPBA-Tools tracing packages and translators such as LTT, AIXtrace, etc. In all the descriptions in this document it is assumed that you have experience with the navigation features of the Paraver display windows as well as some knowledge of the basic concepts of the filter, semantic, display and 2D/3D analysis modules. If not, please refer to the Paraver navigation and concepts. The document is structured in two major sections. The first one will give methodological guidance on the basic analysis of parallel programs. The second section describes how to handle large traces deriving form large platforms and/or program runs. The following index will lead you directly the section describing how to answer a relevant question that may arise during a performance analysis. Navigating through the cross reference links you will be able to traverse a broad exploration space in order to better understand the performance of your application. For each of them you will get a reference to one Paraver configuration that should directly address the issue. 1.1.Global performance and profile 3 1.1.1.Which are the user routines taking most of the time? 3 1.1.2.What is the global efficiency of the application? 4 1.1.3.What is the instantaneous parallelism profile of the application? 5 1.2.1.Measuring computational load imbalance 6 1.2.2.Migrating load imbalance in SPMD patterns 7 1.2.3.Histograms of computation burst 8 1.3.General Communication Performance 9 1.3.1.Is MPI taking a significant fraction of the time? 9 1.4.Point to Point Communication Performance 11 1.4.1.Are point to point MPI calls performing efficiently from a process point of view? 11 1.4.2.Is the end-to-end performance of point to point MPI calls appropriate? 13 1.4.3.What are the message sizes used by point to point calls in the program? 14 1.4.4.What is the system bandwidth the application uses? 15 1.5.Collective Communication Performance 17 1.5.1.Which/how many communicators are used within collective MPI calls? 17 1.5.2.Are collective MPI calls performing efficiently? 17 1.6.Hardware counter based metrics 19 1.7.Can I identify regions perturbed by preemptions in the trace? 23 1.7.1.Searching for preemptions in user mode code 23 1.7.2.Searching for preemptions if cycles counter available 24 1.7.3.Searching for preemptions in MPI code 25 1.9.1.Is the instrumentation overhead high? 26 1.9.2.When do processes flush the trace to disk? 27 2.Handling large systems and runs 28 2.1.Obtaining the original trace 28 2.2.1.Density of MPI calls? 29 2.2.2.From hardware counters to software counters 31
1.Basic analysis
1.1.Global performance and profileA first set of question when looking at a parallel program is which are the main routines in the application and how much parallelism are we obtaining. 1.1.1.Which are the user routines taking most of the time?An initial analysis of a trace should be targeted to the identification of routines of the code where a relevant fraction of the time is spent. At the level of user routines this will point to regions where to further focus attention and optimization efforts. To obtain the typical user functions profile use configuration file: General/analysis/uf_profile.cfg. WARNING: Applicable only if user function events available in the trace. Only functions instrumented appear in the table. Such instrumentation may have been automatically inserted (driven by the file pointed to by environment variable MPTRACE_ADD_FUNCTIONS) with dynamic instrumentation tracing packages. For static tracing packages explictis call to ompitrace_event (60000019, #routine|0) must have been used. What you will see: A table with one row for each thread and one column for each user function. The value indicates the time inside such function (exclusive times between the instrumented routines). Dark blue indicates a high value, light green a low value. Some interpretation capabilities:
Further uses of the configuration file:
1.1.2.What is the global efficiency of the application?Configuration file General/analysis/avg_procs.cfg can be used to get a single number of the performance of the application. What you will see: a single entry 2D window reporting the average number of processes performing useful computation out of the total number of processes. Some interpretation capabilities:
Further uses of the configuration file:
1.1.3.What is the instantaneous parallelism profile of the application?Configuration file General/views/instantaneous_parallelism.cfg displays the total number of processes performing some useful computation at each point in time. Some interpretation capabilities:
Further uses of the configuration file:
1.2.Load balanceA first factor that will determine the performance of a parallel application is the balance of the load between processes. The following questions/descriptions can help an analyst identify the nature of such load distribution by looking at the global and local distribution of the computational chunks between communications. 1.2.1.Measuring computational load imbalanceTo determine whether different processes do compute for different amounts of time inside one routine use configuration file General/analysis/uf_excl_MPI_profile.cfg. WARNING: Applicable only if user function events available in the trace. Only functions instrumented appear in the table. What you will see: A table with one row for each thread and one column for each user function. The value indicates total user level computation time within such function (excluding other traced functions it may call). This profile is similar to the one in section 1.1.1, but without including in the statistics the time in MPI. Some interpretation capabilities:
Related approaches:
1.2.2.Migrating load imbalance in SPMD patternsAn SPMD iterative program may show a type of load imbalance where at each iteration there is a different process with more load than others, while still at the global level the load is fairly balanced. To try and identify this situtation for a sepcific user funciton you can use configuration file General/analysis/load_balance_for_specific_uf.cfg. WARNING: Applicable only if user function events available in the trace. Only functions instrumented appear in the table. WARNING: Assumes an SPMD structure where all processes enter the user routine at about the same time. Skews in the invocation of a routine across processes will actually be considered as load imbalance. What you will see: A table with two columns. Column 1 represents the specific user routine, while column 0 displays the same metric for the rest of the program. Each entry computes the percentage of time the thread is active within the total span of the parallel function. To select a specific user function of interest you have to change the value in the “In Stacked Val Parameter” selector of the semantic module of view “Some process in specific user function”. The value should be the identifier of the routine of interest. Some interpretation capabilities:
1.2.3.Histograms of computation burstConfiguration file General/analysis/3dh_ufduration.cfg can be used to obtain a very detailed WARNING: Applicable only if user function events available in the trace. Only functions instrumented appear in the table. What you will see: A table showing a histogram of the duration of the different CPU bust between successive events. Typically this will be between entry/exits of user functions and between these and the MPI calls. The histogram you will see corresponds to one specific user function. To change the user function, use the selector at the bottom right of the 2D window. Some interpretation capabilities:
1.3.General Communication PerformanceMessage passing and global operations are a need to achieve the cooperation of the different processes in a parallel program, but frequently the first point to blame in case of poor performance. The following questions and descriptions should help an analyst to properly identify to what extent is the communication a real bottleneck for the application performance. 1.3.1.Is MPI taking a significant fraction of the time?To identify potential problems due to communication and synchronization overhead use: mpi/analysis/mpi_stats.cfg. What you will see: A table with one row for each thread and one column for each MPI routine used by the program. The value indicates percentage of the time inside such MPI call. Additionally, column 0 (End) indicates the fraction of time in user level code. Dark blue indicates a high value, light green a low value. Changing the gradient scale representation may highlight differenced between processes (load imbalance reflected in MPI calls). Some interpretation capabilities:
Waits for message reception due to imbalances or externally caused delays (i.e. preemptions) that propagate through the communication dependence chain. Further uses of the configuration file:
Directions for further investigation:
1.4.Point to Point Communication Performance
1.4.1.Are point to point MPI calls performing efficiently from a process point of view?Even if an MPI routine is taking a lot of time, the question is whether the routine is behaving as expected performance wise. As developers/users, we may be ready to accept the overhead of MPI calls as long as the service we obtain from them falls within the conceptual performance model we have of it. For example, we expect isends to take minimal amount of time, we may accept sends to take a time inversely proportional to the nominal bandwidth of the system, …. To assess the local performance of an MPI point to point call use configuration file mpi/analysis/3dh_bw_per_call.cfg. What you will see: A histogram with one row for each thread and one column (bin) for each range of “local cost” of the point to point MPI call. By “local cost” we mean the ratio of microseconds per byte sent/received by the MPI call. This “cost” does not consider the transfer time for that data. It should be seen as a relative measurement of the overhead the call introduced in the sequential execution of the program that called it. The value in an entry indicates the percentage of the time at the corresponding local cost range (computed over the total time inside the MPI call). The histogram you see corresponds to a specific MPI call. Change the selector at the bottom right of the window to analyze the behavior of the MPI call you are interested on. Some interpretation capabilities: In general, the histogram should be useful to identify MPI call invocations with different types of behaviour as well as outliers. By clicking the “Open Control Window Zoom 2D” and selecting a range of interest within the 2D histogram table you will generate a new display window where only the calls of the selected local cost will appear. You may thus look at calls with expected behavior, as well as strange (i.e. very poor) behavior regions. By correlating the scales of these selective views to the user function or MPI call vies you may identify where in the source code is the problem or which communication pattern results in the obtained behavior. The following bullets detail what should be the expected behavior for different MPI calls
Further uses of the configuration file:
1.4.2.Is the end-to-end performance of point to point MPI calls appropriate?Configuration file mpi/views/point2point/s_r_bandwidths.cfg displays the actual amount of communication bandwidth used by each process in the application (one view for send bandwidth, one for the receive bandwidth and one for the sum). What you will see: Timelines reporting for each thread at each point in time the equivalent bandwidth (incoming/outgoing/total) of all point to point message transfers the thread is involved (you may have to open the incoming and outgoing views from within the “Visualizer Module” window). For each message, the equivalent bandwidth it contributes during the duration of the transfer is equal to the ratio between the message size and the duration of such transfer. Every message contributes with such value both to the sender thread on the “Send Bandwidth” view, to the receiver thread on the “Recv Bandwidth view”. The “Process Bandwidth view” is the point wise addition of the other two. Some interpretation capabilities:
Further uses of the configuration file:
1.4.3.What are the message sizes used by point to point calls in the program?A typical concern when an MPI program does not scale is that it may be using many small messages. We may for many reasons also be interested in finding which message sizes are used by the application. This is computed by configuration file cfgs/mpi/analysis/point2point/3dh_msgsize_per_pt2pt_call.cfg can be used. What you will see: A histogram with one row for each thread and one column (bin) for each range of message sizes of the point to point MPI calls. There is actually one such histogram for each MPI point to point call. The selector “Fixed Value” at the bottom of the 3D window can be used to select the desired MPI call. The value in an entry of the table indicates the total number of messages within that range of sizes sent/received by the selected MPI call. If a program uses very large message sizes, the bins of the histogram will be very large, and several message sizes may fall in one such bin. If you are interested in differentiating message sizes in a small range you may use the 2D zoom capability. Applications tend not to use many different message sizes, so you expect to see a few vertical stripes in the 2D table. You can use the “Hide null entries” button of the 2D window and small bin ranges (delta) to get a table with just one column for each one of the message sizes used by the program. Some interpretation capabilities:
Further uses of the configuration file:
Directions for further investigation:
1.4.4.What is the system bandwidth the application uses?Configuration file mpi/views/point2point/total_bw.cfg can be used to visualize the instantaneous amount of communication bandwidth used by point to point calls in the application. What you will see: A timeline computed by summing at each point in time the equivalent bandwidth of all point to point message transfers taking place at that time. Each point to point transfer thus contributes to such function from the moment the sender invokes the send primitive till the receiver gets the message with a magnitude equal to the ratio between the message size and the duration of such transfer. Some interpretation capabilities:
Directions for further investigation:
1.5.Collective Communication Performance
1.5.1.Which/how many communicators are used within collective MPI calls?Some applications only use the COMM_WORLD communicator. Other applications do create and use new communicators. The question often arises for an unknown application whether it is using one or several communicators. In the last case, visualizing which communicators are being used by which group of processes and when is very useful to get a good feeling of the application structure. Configuration file mpi/views/collectives/communicator.cfg should be used to address this issue. What you will see: a timeline that shows the intervals when each process is within a collective MPI call. For each interval, the color represents the identifier of the communicator used by the call. Related approaches:
1.5.2.Are collective MPI calls performing efficiently?Configuration file mpi/views/collectives/collective_bandwidth.cfg can be used to get a good feeling of this issue. What you will see: the configuration file displays a timeline of the ratio between the size of the data involved in the collective for a process and the tame the process has been in the collective. It is a local measure and does not necessarily measure the actual communication bandwidth used by the collective implementation. It is nevertheless a good view to compare different instances of a given MPI collective call. Directions for further investigation: Some issues to investigate are:
1.6.Hardware counter based metricsHardware counter events can be emitted to the trace on entry and exit of user functions and MPI calls (plus direct source code invocation of the trace_event API). From these hardware counter events, a bunch of direct and derived metrics can be computed. Given that the actual hardware counters captured by instrumentation packages are very platform specific, different sets of configuration files have to be provided for each platform. Even for one platform, many combinations of events can be instrumented. As starting point for novice users we provide a set of views that can serve as basic reference. The views may be directly obtained from a single hardware counter or be derived metrics combining several of them. Each view represents a time varying function typically color encoded For each platforms or events set, the view are classified in four major groups:
Specific Hardware Performance Counters: Intel Platforms On many platforms, one is limited to a set of hardware counters being read at the same time, or the number of counters available. For this, one may have to do several mpitrace runs to get a better picture of the related hardware counters. E.g. the family of ia32 processors are quite diverse, with respect to the number of hardware counters may be read out, to the overall width per counter changing from the standard Pentium (P5) to Pentium II/III (P6) to Pentium4/Xeon as is currently in use. Through PAPI, which is using perfctr to read out hardware performance counters on the Intel (ia32, ia32_64 and ia64), Athlon (K7 and Opteron) and PowerPC group of processors, mpitrace has a convenient way to access hardware performance counters. The following counters may be selected through the MPTRACE_COUNTER-environment flag:
Specific Hardware Performance Counters: NEC SX Platforms: On the NEC SX-8 Vector Systems, one has access to the following Platforms, selectable through the MPTRACE_COUNTER-evironment flag: EX (execution counter) The execution counter (EX) is 52 bits long and is incremented by one every time a vector or scalar instruction is executed. When EX overflows, it is reset and starts counting from zero again. VX (vector execution counter) The 48-bit vector execution counter (VX) is incremented by one every time a vector instruction is executed. When VX overflows, it is reset and starts counting from zero again. VE (vector element counter) The vector element counter (VE) is 56 bits long and counts the vector |
| Barcelona Supercomputing Center, 2010 - Legal Notice |