![]() |
|
|
|
| Computer Sciences |
| Influence of interrupts and system daemons |
| Home > Computer Sciences > Performance Tools > Examples of Analysis > Influence of interrupts and system daemons |
|
|
Analyzing the influence of interrupts and system daemons on fine grain applicationsIntroduction The objective of this study was to use the detailed scheduling information provided by the IBM AIX trace facility in combination with the analysis power of Paraver to analyze the influence and inpact of system interrupts in fine grain parallel applications. Under a collaboration contract with Terry Jones from LLNL we are developing a translator from AIXtrace format to Paraver. The objective of this project is to be able to use the flexibility and potential of Paraver to analyze the low level detail information captured by the AIX trace facility. To carry out the study of the system interferences, we wrote a very simple example that loops doing a short computation and a MPI barrier. The tracefile was obtained in a p630 node (Power4, 4 processors) of our IBM SP cluster. The trace includes information of all the system and user processes while our test application run (339 processes). It covers the duration of the test run (aproximately 1.2 seconds). The initial view displays all the processes![]() We can see some daemons that are most of the time yield (green), some system processes in blocked state (red) and many processes that are most of the time stopped (white). We can identify our application processes close to the botton of the display. They are computing for a short while (dark blue). The red part correspond to the execution of the MPI_finalize routine. Distribution of the CPU time Using one of the provided configuration views we can perform a 2D analysis that displays a matrix with the use of the different cpus by the different processes. ![]() Each row correspond to a processor and each column to a process. The color of the cell represent the total time spend by the given process on the given CPU. Yellow cells represent values smaller than 300us, and orange cells times greater than 30ms. What it shows is that certain processes get a very small amount of time and are scheduled without much spatial locality (run similar fractions of time on different processors). Computing the average duration of the CPU burst assigned to them by the OS (changing the Statistic field to Average Burst Time) we get the following view. ![]() Typical values for many of those runs are a few tens of microseconds (yellow cells mark average burst smaller than 30us, orange greater than 900us). If we change the Statistic to #Burst!=0 we can compute the number of times each process was dispatched to each CPU. ![]() With this view we can see that despite most of the processes run on at least 3 of the 4 processors, some of them where dispatched to one of the CPUs in a higher percentage. Looking for the system daemons Our interest is to visualize when the processes that execute for few microseconds are run. To do that we manually selected a set of processes that on the state as is view (the default view opened after loading a trace) were most of the time in YIELD state (represented by a kind of green color). The following view shows a typical pattern of these processes![]() We identify some kind of periodicity on the flags distribution These flags appear while the processes execute for short time when they wake up. We can use the 2D-analysis module to measure the duration of the burst on the different states. ![]() We can see than all the selected daemons have more or less the same behavior. The standard deviation of the columns can be considered small to say that basically the daemons are in yield state for periods of around 46.4ms and between them they run for about 21us. In average the ready bursts of these processes was 10.6us. We can do the same analysis from the point of view of the CPUs creating a view that mask all the other processes except our selected daemons. ![]() We can see that all the daemons run on many of the CPUs with chucks of approximately the same duration (between 16 and 25.5us). Analyzing the impact on the test program A concern that this raises is about the effect of these preemptions on the parallel applications. If the application has fine granularity (frequent communication, collectives...) as it is the case of our example, the effect can be important, as busy waiting will take place on other processes while one is preempted. To study this effect, we inserted in the source code of our test some user events to mark which part of the code is executing (blue is barrier, dark red is computing and light red is loop overhead or outside the loop). Displaying a view with these events and zooming into the area with the events flags we can start to identify some barriers that last more time.![]() To confirm this feeling, we can load a new view that displays the duration of the MPI_barrier call. If we zoom until we capture an area of few barriers we will see something similar to the next figure. ![]() This view validates that there are some barriers that take longer than the average. Clicking on the different areas we can measure the increase on the duration: most of the barriers last around 14us while the larger ones took 66 and 93us (4 and 6 times the average duration!). But what it is most important, we can also see that on the large barriers, there is the same pattern: 3 MPI tasks are waiting until the 4th task finish the computation and enter the barrier. We certainly know that all the tasks execute the same code, so we can suspect that the interrupts of the system processes cause this delay. If we synchronize the threads mapping view with the barrier duration zoom we will see the if the delay on the task was caused by some system interference. ![]() We can confirm that each time one of the tasks is delayed is because it was preempted during the computation loop. Using the All window option of the 2d-anaysis we can measure, only for this part of the trace, the time that each process was running on each CPU. ![]() We can see that the 4 MPI tasks of our program consumed most of the time, while each daemon processes only run for 24-26us. Nevertheless as we have seen these small interrupts result in a significant increase in the duration of the barrier calls compared to the situations when there is no preemption. Conclusions This page describes an example of the types of analyses that can be performed on traces collected with AIXtrace from IBM. This tracing tool captures a large amount of very detailed information on process activity, context switches, system calls,.... from SMP nodes running AIX. It is quite typical for users of AIXtrace to only process a minimal part of the information captured as the textual format of the browsing tools provided with AIXtrace does not give general views and makes the detailed anlysis cumbersome. In order to leverage the analysis power of Paraver on AIXtraces, a translator was developed (under support from LLNL) to encode in the Paraver format the information contained in an AIX trace. This development demonstrated the generality of the Paraver trace format. The use of Paraver now exposes in single views the golbal behavior of the system. It is also possible analyze the detailed behavior of very small time intervals. The extremely powerful performance data processing capabilitites of Paraver can now be used to analyze the very detailed information captured by AIXtrace As an example, we have analyzed in this page the influence of preemptions and system daemons activity in a fine grain parallel application. Timelines of process to processor mappings as well as computed detailed statistics help us understand the OS scheduling mechanism, clock interrupt frequency and skew across CPUs, daemon characteristics in terms of CPU demand each time they are activated,.... The correlation of OS activity with internal events in an application let us observe the effect of those preemptions and system daemons on the application performance (specially waits at synchornization points). These types of effects are often the cause of poor scalalbility in some global operations. |
| Barcelona Supercomputing Center, 2010 - Legal Notice |