Performance properties

Time

Description:
Total time spent for program execution including the idle times of CPUs reserved for slave threads during OpenMP sequential execution. This pattern assumes that every thread of a process allocated a separate CPU during the entire runtime of the process.
Unit:
Seconds
Diagnosis:
Expand the metric tree hierarchy to break down total time into constituent parts which will help determine how much of it is due to local/serial computation versus MPI and/or OpenMP parallelization costs, and how much of that time is wasted waiting for other processes or threads due to ineffective load balance or due to insufficient parallelism.

Expand the call tree to identify important callpaths and routines where most time is spent, and examine the times for each process or thread to locate load imbalance.
Parent:
None
Children:
Execution Time, Overhead Time, OpenMP Idle threads Time

Visits

Description:
Number of times a call path has been visited. Visit counts for MPI routine call paths directly relate to the number of MPI Communications and Synchronizations. Visit counts for OpenMP operations and parallel regions (loops) directly relate to the number of times they were executed. Routines which were not instrumented, or were filtered during measurement, do not appear on recorded call paths. Similarly, routines are not shown if the compiler optimizer successfully in-lined them prior to automatic instrumentation.
Unit:
Counts
Diagnosis:
Call paths that are frequently visited (and thereby have high exclusive Visit counts) can be expected to have an important role in application execution performance (e.g., Execution Time). Very frequently executed routines, which are relatively short and quick to execute, may have an adverse impact on measurement quality. This can be due to instrumentation preventing in-lining and other compiler optimizations and/or overheads associated with measurement such as reading timers and hardware counters on routine entry and exit. When such routines consist solely of local/sequential computation (i.e., neither communication nor synchronization), they should be eliminated to improve the quality of the parallel measurement and analysis. One approach is to specify the names of such routines in a filter file for subsequent measurements to ignore, and thereby considerably reduce their measurement impact. Alternatively, selective instrumentation can be employed to entirely avoid instrumenting such routines and thereby remove all measurement impact. In both cases, uninstrumented and filtered routines will not appear in the measurement and analysis, much as if they had been "in-lined" into their calling routine.
Parent:
None
Children:
None

Execution Time

Description:
Time spent on program execution but without the idle times of slave threads during OpenMP sequential execution.
Unit:
Seconds
Diagnosis:
Expand the call tree to determine important callpaths and routines where most exclusive execution time is spent, and examine the time for each process or thread on those callpaths looking for significant variations which might indicate the origin of load imbalance.

Where exclusive execution time on each process/thread is unexpectedly slow, profiling with PAPI preset or platform-specific hardware counters may help to understand the origin. Serial program profiling tools (e.g., gprof) may also be helpful. Generally, compiler optimization flags and optimized libraries should be investigated to improve serial performance, and where necessary alternative algorithms employed.
Parent:
Time
Children:
MPI Time, OpenMP Time

Overhead Time

Description:
Time spent performing major tasks related to measurement, such as creation of the experiment archive directory, clock synchronization or dumping trace buffer contents to a file. Note that normal per-event overheads – such as event acquisition, reading timers and hardware counters, runtime call-path summarization and storage in trace buffers – is not included.
Unit:
Seconds
Diagnosis:
Significant measurement overheads are typically incurred when measurement is initialized (e.g., in the program main routine or MPI_Init) and finalized (e.g., in MPI_Finalize), and are generally unavoidable. While they extend the total (wallclock) time for measurement, when they occur before parallel execution starts or after it completes, the quality of measurement of the parallel execution is not degraded. Trace file writing overhead time can be kept to a minimum by specifying an efficient parallel filesystem (when provided) for the experiment archive (e.g., EPK_GDIR=/work/mydir) and not specifying a different location for intermediate files (i.e., EPK_LDIR=$EPK_GDIR).

When measurement overhead is reported for other call paths, especially during parallel execution, measurement perturbation is considerable and interpretation of the resulting analysis much more difficult. A common cause of measurement overhead during parallel execution is the flushing of full trace buffers to disk: warnings issued by the EPIK measurement system indicate when this occurs. When flushing occurs simultaneously for all processes and threads, the associated perturbation is localized. More usually, buffer filling and flushing occurs independently at different times on each process/thread and the resulting perturbation is extremely disruptive, often forming a catastrophic chain reaction. It is highly advisable to avoid intermediate trace flushes by appropriate instrumentation and measurement configuration, such as specifying a filter file listing purely computational routines (classified as type USR by cube3_score -r ) or an adequate trace buffer size (ELG_BUFFER_SIZE larger than max_tbc reported by cube3_score). If the maximum trace buffer capacity requirement remains too large for a full-size measurement, it may be necessary to configure the subject application with a smaller problem size or to perform fewer iterations/timesteps to shorten the measurement (and thereby reduce the size of the trace).
Parent:
Time
Children:
None

MPI Time

Description:
This pattern refers to the time spent in (instrumented) MPI calls.
Unit:
Seconds
Diagnosis:
Expand the metric tree to determine which classes of MPI operation contribute the most time. Typically the remaining (exclusive) MPI Time, corresponding to instrumented MPI routines that are not in one of the child classes, will be negligible. There can, however, be significant time in collective operations such as MPI_Comm_create, MPI_Comm_free and MPI_Cart_create that are considered neither explicit synchronization nor communication, but result in implicit barrier synchronization of participating processes. Avoidable waiting time for these operations will be reduced if all processes execute them simultaneously. If these are repeated operations, e.g., in a loop, it is worth investigating whether their frequency can be reduced by re-use.
Parent:
Execution Time
Children:
MPI Synchronization Time, MPI Communication Time, MPI File I/O Time, MPI Init/Exit Time

MPI Synchronization Time

Description:
This pattern refers to the time spent in MPI explicit synchronization calls, i.e., barriers. Time in point-to-point messages with no data used for coordination is currently part of MPI Point-to-point Communication Time.
Unit:
Seconds
Diagnosis:
Expand the metric tree further to determine the proportion of time in different classes of MPI synchronization operations. Expand the calltree to identify which callpaths are responsible for the most synchronization time. Also examine the distribution of synchronization time on each participating process for indication of load imbalance in preceding code.
Parent:
MPI Time
Children:
MPI Collective Synchronization Time, Remote Memory Access Synchronization Time

MPI Communication Time

Description:
This pattern refers to the time spent in MPI communication calls.
Unit:
Seconds
Diagnosis:
Expand the metric tree further to determine the proportion of time in different classes of MPI communication operations. Expand the calltree to identify which callpaths are responsible for the most communication time. Also examine the distribution of communication time on each participating process for indication of communication imbalance or load imbalance in preceding code.
Parent:
MPI Time
Children:
MPI Point-to-point Communication Time, MPI Collective Communication Time, Remote Memory Access Communication Time

MPI File I/O Time

Description:
This pattern refers to the time spent in MPI file I/O calls.
Unit:
Seconds
Diagnosis:
Expand the metric tree further to determine the proportion of time in different classes of MPI file I/O operations. Expand the calltree to identify which callpaths are responsible for the most file I/O time. Also examine the distribution of MPI file I/O time on each process for indication of load imbalance. Use a parallel filesystem (such as /work) when possible, and check that appropriate hints values have been associated with the MPI_Info object of MPI files.

Exclusive MPI file I/O time relates to individual (non-collective) operations. When multiple processes read and write to files, MPI collective file reads and writes can be more efficient.
Parent:
MPI Time
Children:
MPI Collective File I/O Time

MPI Collective File I/O Time

Description:
This pattern refers to the time spent in collective MPI file I/O calls.
Unit:
Seconds
Diagnosis:
Expand the calltree to identify which callpaths are responsible for the most collective file I/O time. Examine the distribution of times on each participating process for indication of imbalance in the operation itself or in preceding code. Examine the number of MPI File Collective Operations done by each process as a possible origin of imbalance. Where asychrony or imbalance prevents effective use of collective file I/O, (non-collective) individual file I/O may be preferable.
Parent:
MPI File I/O Time
Children:
None

MPI Init/Exit Time

Description:
Time spent in MPI initialization and finalization calls, i.e., MPI_Init or MPI_Init_thread and MPI_Finalize.
Unit:
Seconds
Diagnosis:
These are unavoidable one-off costs for MPI parallel programs, which can be expected to increase for larger numbers of processes. Some applications may not use all of the processes provided (or not use some of them for the entire execution), such that unused and wasted processes wait in MPI_Finalize for the others to finish. If the proportion of time in these calls is significant, it is probably more effective to use a smaller number of processes (or a larger amount of computation).
Parent:
MPI Time
Children:
None

MPI Collective Synchronization Time

Description:
This pattern refers to the total time spent in MPI barriers.
Unit:
Seconds
Diagnosis:
When the time for MPI explicit barrier synchronization is significant, expand the call tree to determine which MPI_Barrier calls are responsible, and compare with their Visits count to see how frequently they were executed. Barrier synchronizations which are not necessary for correctness should be removed. It may also be appropriate to use a communicator containing fewer processes, or a number of point-to-point messages for coordination instead. Also examine the distribution of time on each participating process for indication of load imbalance in preceding code.

Automatic trace analysis can be employed to quantify time wasted due to Wait at Barrier Time at entry and Barrier Completion Time.
Parent:
MPI Synchronization Time
Children:
Wait at Barrier Time, Barrier Completion Time

Wait at Barrier Time

Description:
This pattern covers the time spent waiting in front of an MPI barrier, which is the time inside the barrier call until the last processes has reached the barrier.


Wait at Barrier Example

Unit:
Seconds
Diagnosis:
A large amount of waiting time at barriers can be an indication of load imbalance. Examine the waiting times for each process and try to distribute the preceding computation from processes with the shortest waiting times to those with the longest waiting times.
Parent:
MPI Collective Synchronization Time
Children:
None

Barrier Completion Time

Description:
This pattern refers to the time spent in MPI barriers after the first process has left the operation.


Barrier Completion Example

Unit:
Seconds
Diagnosis:
Generally all processes can be expected to leave MPI barriers simultaneously, and any significant barrier completion time may indicate an inefficient MPI implementation or interference from other processes running on the same compute resources.
Parent:
MPI Collective Synchronization Time
Children:
None

MPI Point-to-point Communication Time

Description:
This pattern refers to the total time spent in MPI point-to-point communication calls. Note that this is only the respective times for the sending and receiving calls, and not message transmission time.
Unit:
Seconds
Diagnosis:
Investigate whether communication time is commensurate with the number of Communications and Bytes Transferred. Consider replacing blocking communication with non-blocking communication that can potentially be overlapped with computation, or using persistent communication to amortize message setup costs for common transfers. Also consider the mapping of processes onto compute resources, especially if there are notable differences in communication time for particular processes, which might indicate longer/slower transmission routes or network congestion.
Parent:
MPI Communication Time
Children:
Late Sender Time, Late Receiver Time

Late Sender Time

Description:
Refers to the time lost waiting caused by a blocking receive operation (e.g., MPI_Recv or MPI_Wait) that is posted earlier than the corresponding send operation.


Late Sender Example

If the receiving process is waiting for multiple messages to arrive (e.g., in an call to MPI_Waitall), the maximum waiting time is accounted, i.e., the waiting time due to the latest sender.
Unit:
Seconds
Diagnosis:
Try to post sends earlier, such that they are available when receivers need them. Note that outstanding messages (i.e., sent before the receiver is ready) will occupy internal message buffers.
Parent:
MPI Point-to-point Communication Time
Children:
Late Sender, Wrong Order Time

Late Sender, Wrong Order Time

Description:
A Late Sender situation may be the result of messages that are received in the wrong order. If a process expects messages from one or more processes in a certain order, although these processes are sending them in a different order, the receiver may need to wait for a message if it tries to receive a message early that has been sent late.

This pattern comes in two variants: See the description of the corresponding specializations for more details.
Unit:
Seconds
Diagnosis:
Check the proportion of Point-to-point Receive Communications that are Late Sender Instances (Communications). Swap the order of receiving from different sources to match the most common ordering.
Parent:
Late Sender Time
Children:
Late Sender, Wrong Order Time / Different Sources, Late Sender, Wrong Order Time / Same Source

Late Sender, Wrong Order Time / Different Sources

Description:
This specialization of the Late Sender, Wrong Order pattern refers to wrong order situations due to messages received from different source locations.


Messages from different sources Example

Unit:
Seconds
Diagnosis:
Check the proportion of Point-to-point Receive Communications that are Late Sender, Wrong Order Instances (Communications). Swap the order of receiving from different sources to match the most common ordering. Consider using the wildcard MPI_ANY_SOURCE to receive (and process) messages as they arrive from any source rank.
Parent:
Late Sender, Wrong Order Time
Children:
None

Late Sender, Wrong Order Time / Same Source

Description:
This specialization of the Late Sender, Wrong Order pattern refers to wrong order situations due to messages received from the same source location.


Messages from same source Example

Unit:
Seconds
Diagnosis:
Swap the order of receiving to match the order messages are sent, or swap the order of sending to match the order they are expected to be received. Consider using the wildcard MPI_ANY_TAG to receive (and process) messages in the order they arrive from the source.
Parent:
Late Sender, Wrong Order Time
Children:
None

Late Receiver Time

Description:
A send operation may be blocked until the corresponding receive operation is called, and this pattern refers to the time spent waiting as a result of this situation.


Late Receiver Example

Note that this pattern does currently not apply to nonblocking sends waiting in the corresponding completion call, e.g., MPI_Wait.
Unit:
Seconds
Diagnosis:
Check the proportion of Point-to-point Send Communications that are Late Receiver Instances (Communications). The MPI implementation may be working in synchronous mode by default, such that explicit use of asynchronous nonblocking sends can be tried. If the size of the message to be sent exceeds the available MPI internal buffer space then the operation will be blocked until the data can be transferred to the receiver: some MPI implementations allow larger internal buffers or different thresholds to be specified. Also consider the mapping of processes onto compute resources, especially if there are notable differences in communication time for particular processes, which might indicate longer/slower transmission routes or network congestion.
Parent:
MPI Point-to-point Communication Time
Children:
None

MPI Collective Communication Time

Description:
This pattern refers to the total time spent in MPI collective communication calls.
Unit:
Seconds
Diagnosis:
MPI implementations generally provide optimized collective communication operations, however, as the number of MPI processes increase (i.e., ranks in MPI_COMM_WORLD or a subcommunicator), time in collective communication can be expected to increase correspondingly. Part of the increase will be due to additional data transmission requirements, which are generally similar for all participants. A significant part is typically time some (often many) processes are blocked waiting for the last of the required participants to reach the collective operation. This may be indicated by significant variation in collective communication time across processes, but is most conclusively quantified from the child metrics determinable via automatic trace pattern analysis.

In rare cases, it may be appropriate to replace a collective communication operation provided by the MPI implementation with an alternative implementation of your own using point-to-point operations. For example, certain MPI implementations of MPI_Scan include unnecessary synchronization of all participating processes, or asynchronous variants of collective operations may be preferable to fully synchronous ones.
Parent:
MPI Communication Time
Children:
Early Reduce Time, Early Scan Time, Late Broadcast Time, Wait at N x N Time, N x N Completion Time

Early Reduce Time

Description:
Collective communication operations that send data from all processes to one destination process (i.e., n-to-1) may suffer from waiting times if the destination process enters the operation earlier than its sending counterparts, that is, before any data could have been sent. The pattern refers to the time lost as a result of this situation. It applies to the MPI calls MPI_Reduce, MPI_Gather and MPI_Gatherv.


Early Reduce Example

Unit:
Seconds
Parent:
MPI Collective Communication Time
Children:
None

Early Scan Time

Description:
MPI_Scan or MPI_Exscan operations may suffer from waiting times if the process with rank n enters the operation earlier than its sending counterparts (i.e., ranks 0..n-1). The pattern refers to the time lost as a result of this situation.


Early Scan Example

Unit:
Seconds
Parent:
MPI Collective Communication Time
Children:
None

Late Broadcast Time

Description:
Collective communication operations that send data from one source process to all processes (i.e., 1-to-n) may suffer from waiting times if destination processes enter the operation earlier than the source process, that is, before any data could have been sent. The pattern refers to the time lost as a result of this situation. It applies to the MPI calls MPI_Bcast, MPI_Scatter and MPI_Scatterv.


Late Broadcast Example

Unit:
Seconds
Parent:
MPI Collective Communication Time
Children:
None

Wait at N x N Time

Description:
Collective communication operations that send data from all processes to all processes (i.e., n-to-n) exhibit an inherent synchronization among all participants, that is, no process can finish the operation until the last process has started it. This pattern covers the time spent in n-to-n operations until all processes have reached it. It applies to the MPI calls MPI_Reduce_scatter, MPI_Reduce_scatter_block, MPI_Allgather, MPI_Allgatherv, MPI_Allreduce and MPI_Alltoall.


Wait at N x N Example

Note that the time reported by this pattern is not necessarily completely waiting time since some processes could – at least theoretically – already communicate with each other while others have not yet entered the operation.
Unit:
Seconds
Parent:
MPI Collective Communication Time
Children:
None

N x N Completion Time

Description:
This pattern refers to the time spent in MPI n-to-n collectives after the first process has left the operation.


N x N Completion Example

Note that the time reported by this pattern is not necessarily completely waiting time since some processes could – at least theoretically – still communicate with each other while others have already finished communicating and exited the operation.
Unit:
Seconds
Parent:
MPI Collective Communication Time
Children:
None

OpenMP Idle threads Time

Description:
Idle time on CPUs that may be reserved for teams of threads when the process is executing sequentially before and after OpenMP parallel regions, or with less than the full team within OpenMP parallel regions.


OMP Example

Unit:
Seconds
Diagnosis:
On shared compute resources, unused threads may simply sleep and allow the resources to be used by other applications, however, on dedicated compute resources (or where unused threads busy-wait and thereby occupy the resources) their idle time is charged to the application. According to Amdahl's Law, the fraction of inherently serial execution time limits the effectiveness of employing additional threads to reduce the execution time of parallel regions. Where the Idle Threads Time is significant, total Time (and wall-clock execution time) may be reduced by effective parallelization of sections of code which execute serially. Alternatively, the proportion of wasted Idle Threads Time will be reduced by running with fewer threads, albeit resulting in a longer wall-clock execution time but more effective usage of the allocated compute resources.
Parent:
Time
Children:
OpenMP Limited parallelism Time

OpenMP Limited parallelism Time

Description:
Idle time on CPUs that may be reserved for threads within OpenMP parallel regions where not all of the thread team participates.


OMP Example

Unit:
Seconds
Diagnosis:
Code sections marked as OpenMP parallel regions which are executed serially (i.e., only by the master thread) or by less than the full team of threads, can result in allocated but unused compute resources being wasted. Typically this arises from insufficient work being available within the marked parallel region to productively employ all threads. This may be because the loop contains too few iterations or the OpenMP runtime has determined that additional threads would not be productive. Alternatively, the OpenMP omp_set_num_threads API or num_threads or if clauses may have been explicitly specified, e.g., to reduce parallel execution overheads such as OpenMP Management Time or OpenMP Synchronization Time. If the proportion of OpenMP Limited parallelism Time is significant, it may be more efficient to run with fewer threads for that problem size.
Parent:
OpenMP Idle threads Time
Children:
None

OpenMP Time

Description:
Time spent in OpenMP API calls and code generated by the OpenMP compiler.
Unit:
Seconds
Parent:
Execution Time
Children:
OpenMP Flush Time, OpenMP Management Time, OpenMP Synchronization Time

OpenMP Flush Time

Description:
Time spent in OpenMP flush directives.
Unit:
Seconds
Parent:
OpenMP Time
Children:
None

OpenMP Management Time

Description:
Time spent managing teams of threads, creating and initializing them when forking a new parallel region and clearing up afterwards when joining.


Management Example

Unit:
Seconds
Diagnosis:
Management overhead for an OpenMP parallel region depends on the number of threads to be employed and the number of variables to be initialized and saved for each thread, each time the parallel region is executed. Typically a pool of threads is used by the OpenMP runtime system to avoid forking and joining threads in each parallel region, however, threads from the pool still need to be added to the team and assigned tasks to perform according to the specified schedule. When the overhead is a significant proportion of the time for executing the parallel region, it is worth investigating whether several parallel regions can be combined to amortize thread management overheads. Alternatively, it may be appropriate to reduce the number of threads either for the entire execution or only for this parallel region (e.g., via num_threads or if clauses).
Parent:
OpenMP Time
Children:
OpenMP Management Fork Time

OpenMP Management Fork Time

Description:
Time spent creating and initializing teams of threads.


Fork Example

Unit:
Seconds
Parent:
OpenMP Management Time
Children:
None

OpenMP Synchronization Time

Description:
Time spent in OpenMP synchronization, whether barriers or mutual exclusion via critical sections, atomics or lock API calls.
Unit:
Seconds
Parent:
OpenMP Time
Children:
OpenMP Barrier Synchronization Time, OpenMP Critical Synchronization Time, OpenMP Lock API Synchronization Time

OpenMP Barrier Synchronization Time

Description:
Time spent in implicit (compiler-generated) or explicit (user-specified) OpenMP barrier synchronization. Note that during measurement implicit barriers are treated similar to explicit ones. The instrumentation procedure replaces an implicit barrier with an explicit barrier enclosed by the parallel construct. This is done by adding a nowait clause and a barrier directive as the last statement of the parallel construct. In cases where the implicit barrier cannot be removed (i.e., parallel region), the explicit barrier is executed in front of the implicit barrier, which will then be negligible because the team will already be synchronized when reaching it. The synthetic explicit barrier appears as a special implicit barrier construct.
Unit:
Seconds
Parent:
OpenMP Synchronization Time
Children:
OpenMP Explicit Barrier Synchronization Time, OpenMP Implicit Barrier Synchronization Time

OpenMP Explicit Barrier Synchronization Time

Description:
Time spent in explicit (i.e., user-specified) OpenMP barrier synchronization.
Unit:
Seconds
Diagnosis:
Locate the most costly barrier synchronizations and determine whether they are necessary to ensure correctness or could be safely removed (based on algorithm analysis). Consider replacing an explicit barrier with a potentially more efficient construct, such as a critical section or atomic, or use explicit locks. Examine the time that each thread spends waiting at each explicit barrier, and try to re-distribute preceding work to improve load balance.
Parent:
OpenMP Barrier Synchronization Time
Children:
None

OpenMP Implicit Barrier Synchronization Time

Description:
Time spent in implicit (i.e., compiler-generated) OpenMP barrier synchronization.
Unit:
Seconds
Diagnosis:
Examine the time that each thread spends waiting at each implicit barrier, and if there is a significant imbalance then investigate whether a schedule clause is appropriate. Note that dynamic and guided schedules may require more OpenMP Management Time than static schedules. Consider whether it is possible to employ the nowait clause to reduce the number of implicit barrier synchronizations.
Parent:
OpenMP Barrier Synchronization Time
Children:
None

OpenMP Critical Synchronization Time

Description:
Time spent waiting to enter OpenMP critical sections and in atomics, where mutual exclusion restricts access to a single thread at a time.
Unit:
Seconds
Diagnosis:
Locate the most costly critical sections and atomics and determine whether they are necessary to ensure correctness or could be safely removed (based on algorithm analysis).
Parent:
OpenMP Synchronization Time
Children:
None

OpenMP Lock API Synchronization Time

Description:
Time spent in OpenMP API calls dealing with locks.
Unit:
Seconds
Diagnosis:
Locate the most costly usage of locks and determine whether they are necessary to ensure correctness or could be safely removed (based on algorithm analysis). Consider re-writing the algorithm to use lock-free data structures.
Parent:
OpenMP Synchronization Time
Children:
None

OpenMP Idle threads Time

Description:
Idle time on CPUs that may be reserved for teams of threads when the process is executing sequentially before and after OpenMP parallel regions, or with less than the full team within OpenMP parallel regions.


OMP Example

Unit:
Seconds
Parent:
Time
Children:
OpenMP Limited parallelism Time

OpenMP Limited parallelism Time

Description:
Idle time on CPUs that may be reserved for threads within OpenMP parallel regions where not all of the thread team participates.


OMP Example

Unit:
Seconds
Parent:
OpenMP Idle threads Time
Children:
None

OpenMP Time

Description:
Time spent in OpenMP API calls and code generated by the OpenMP compiler.
Unit:
Seconds
Parent:
Execution Time
Children:
OpenMP Flush Time, OpenMP Management Time, OpenMP Synchronization Time

OpenMP Flush Time

Description:
Time spent in OpenMP flush directives.
Unit:
Seconds
Parent:
OpenMP Time
Children:
None

OpenMP Management Time

Description:
Time spent managing teams of threads, creating and initializing them when forking a new parallel region and clearing up afterwards when joining.


Management Example

Unit:
Seconds
Parent:
OpenMP Time
Children:
OpenMP Management Fork Time

OpenMP Management Fork Time

Description:
Time spent creating and initializing teams of threads.


Fork Example

Unit:
Seconds
Parent:
OpenMP Management Time
Children:
None

OpenMP Synchronization Time

Description:
Time spent in OpenMP synchronization, whether barriers or mutual exclusion via critical sections, atomics or lock API calls.
Unit:
Seconds
Parent:
OpenMP Time
Children:
OpenMP Barrier Synchronization Time, OpenMP Critical Synchronization Time, OpenMP Lock API Synchronization Time

OpenMP Barrier Synchronization Time

Description:
Time spent in implicit (compiler-generated) or explicit (user-specified) OpenMP barrier synchronization. Note that during measurement implicit barriers are treated similar to explicit ones. The instrumentation procedure replaces an implicit barrier with an explicit barrier enclosed by the parallel construct. This is done by adding a nowait clause and a barrier directive as the last statement of the parallel construct. In cases where the implicit barrier cannot be removed (i.e., parallel region), the explicit barrier is executed in front of the implicit barrier, which will then be negligible because the team will already be synchronized when reaching it. The synthetic explicit barrier appears as a special implicit barrier construct.
Unit:
Seconds
Parent:
OpenMP Synchronization Time
Children:
OpenMP Explicit Barrier Synchronization Time, OpenMP Implicit Barrier Synchronization Time

OpenMP Explicit Barrier Synchronization Time

Description:
Time spent in explicit (i.e., user-specified) OpenMP barrier synchronization.
Unit:
Seconds
Parent:
OpenMP Barrier Synchronization Time
Children:
None

OpenMP Implicit Barrier Synchronization Time

Description:
Time spent in implicit (i.e., compiler-generated) OpenMP barrier synchronization.
Unit:
Seconds
Parent:
OpenMP Barrier Synchronization Time
Children:
None

OpenMP Critical Synchronization Time

Description:
Time spent waiting to enter OpenMP critical sections and in atomics, where mutual exclusion restricts access to a single thread at a time.
Unit:
Seconds
Parent:
OpenMP Synchronization Time
Children:
None

OpenMP Lock API Synchronization Time

Description:
Time spent in OpenMP API calls dealing with locks.
Unit:
Seconds
Parent:
OpenMP Synchronization Time
Children:
None

Synchronizations

Description:
This metric provides the total number of MPI synchronization operations that were executed. This not only includes barrier calls, but also communication operations which transfer no data (i.e., zero-sized messages are considered to be used for coordination synchronization).
Unit:
Counts
Parent:
None
Children:
Point-to-point Synchronizations, Collective Synchronizations

Point-to-point Synchronizations

Description:
Provides the total number of MPI point-to-point synchronization operations, i.e., point-to-point transfers of zero-sized messages used for coordination.
Unit:
Counts
Diagnosis:
Locate the most costly synchronizations and determine whether they are necessary to ensure correctness or could be safely removed (based on algorithm analysis).
Parent:
Synchronizations
Children:
Point-to-point Send Synchronizations, Point-to-point Receive Synchronizations

Point-to-point Send Synchronizations

Description:
Provides the number of MPI point-to-point synchronization operations sending a zero-sized message.
Unit:
Counts
Parent:
Point-to-point Synchronizations
Children:
Late Receiver Instances (Synchronizations)

Point-to-point Receive Synchronizations

Description:
Provides the number of MPI point-to-point synchronization operations receiving a zero-sized message.
Unit:
Counts
Parent:
Point-to-point Synchronizations
Children:
Late Sender Instances (Synchronizations)

Collective Synchronizations

Description:
Provides the number of MPI collective synchronization operations. This does not only include barrier calls, but also calls to collective communication operations that are neither sending nor receiving any data.
Unit:
Counts
Diagnosis:
Locate synchronizations with the largest MPI Collective Synchronization Time and determine whether they are necessary to ensure correctness or could be safely removed (based on algorithm analysis). Collective communication operations that neither send nor receive data, yet are required for synchronization, can be replaced with the more efficient MPI_Barrier.
Parent:
Synchronizations
Children:
None

Communications

Description:
Provides the total number of MPI communication operations, excluding calls transferring no data (which are considered Synchronizations).
Unit:
Counts
Parent:
None
Children:
Point-to-point Communications, Collective Communications

Point-to-point Communications

Description:
Provides the number of MPI point-to-point communication operations, excluding calls transferring zero-sized messages.
Unit:
Counts
Parent:
Communications
Children:
Point-to-point Send Communications, Point-to-point Receive Communications

Point-to-point Send Communications

Description:
Provides the number of MPI point-to-point send operations, excluding calls transferring zero-sized messages.
Unit:
Counts
Parent:
Point-to-point Communications
Children:
Late Receiver Instances (Communications)

Point-to-point Receive Communications

Description:
Provides the number of MPI point-to-point receive operations, excluding calls transferring zero-sized messages.
Unit:
Counts
Parent:
Point-to-point Communications
Children:
Late Sender Instances (Communications)

Collective Communications

Description:
Provides the number of MPI collective communication operations, excluding calls neither sending nor receiving any data.
Unit:
Counts
Parent:
Communications
Children:
Collective Exchange Communications, Collective Communications as Source, Collective Communications as Destination

Collective Exchange Communications

Description:
Provides the number of MPI collective communication operations which are both sending and receiving data.
Unit:
Counts
Parent:
Collective Communications
Children:
None

Collective Communications as Source

Description:
Provides the number of MPI collective communication operations that are only sending but not receiving data.
Unit:
Counts
Parent:
Collective Communications
Children:
None

Collective Communications as Destination

Description:
Provides the number of MPI collective communication operations that are only receiving but not sending data.
Unit:
Counts
Parent:
Collective Communications
Children:
None

Bytes Transferred

Description:
Provides the total number of bytes that were notionally processed in MPI communication operations (i.e., the sum of the bytes that were sent and received). Note that the actual number of bytes transferred is typically not determinable, as this is dependant on the MPI internal implementation, including message transfer and failed delivery recovery protocols.
Unit:
Bytes
Diagnosis:
Expand the metric tree to break down the bytes transferred into constituent classes. Expand the call tree to identify where most data is transferred and examine the distribution of data transferred by each process.
Parent:
None
Children:
Point-to-point Bytes Transferred, Collective Bytes Transferred, Remote Memory Access Bytes Transferred

Point-to-point Bytes Transferred

Description:
Provides the total number of bytes that were notionally processed by MPI point-to-point communication operations.
Unit:
Bytes
Diagnosis:
Expand the calltree to identify where the most data is transferred using point-to-point communication and examine the distribution of data transferred by each process. Compare with the number of Point-to-point Communications and resulting MPI Point-to-point Communication Time.

Average message size can be determined by dividing by the number of MPI Point-to-point Communications (for all call paths or for particular call paths or communication operations). Instead of large numbers of small communications streamed to the same destination, it may be more efficient to pack data into fewer larger messages (e.g., using MPI datatypes). Very large messages may require a rendez-vous between sender and receiver to ensure sufficient transmission and receipt capacity before sending commences: try splitting large messages into smaller ones that can be transferred asynchronously and overlapped with computation. (Some MPI implementations allow tuning of the rendez-vous threshold and/or transmission capacity, e.g., via environment variables.)
Parent:
Bytes Transferred
Children:
Point-to-point Bytes Sent, Point-to-point Bytes Received

Point-to-point Bytes Sent

Description:
Provides the number of bytes that were notionally sent using MPI point-to-point communication operations.
Unit:
Bytes
Diagnosis:
Expand the calltree to see where the most data is sent using point-to-point communication operations and examine the distribution of data sent by each process. Compare with the number of Point-to-point Send Communications and resulting MPI Point-to-point Communication Time.

If the aggregate Point-to-point Bytes Received is less than the amount sent, some messages were cancelled, received into buffers which were too small, or simply not received at all. (Generally only aggregate values can be compared, since sends and receives take place on different callpaths and on different processes.) Sending more data than is received wastes network bandwidth. Applications do not conform to the MPI standard when they do not receive all messages that are sent, and the unreceived messages degrade performance by consuming network bandwidth and/or occupying message buffers. Cancelling send operations is typically expensive, since it usually generates one or more internal messages.
Parent:
Point-to-point Bytes Transferred
Children:
None

Point-to-point Bytes Received

Description:
Provides the number of bytes that were notionally received using MPI point-to-point communication operations.
Unit:
Bytes
Diagnosis:
Expand the calltree to see where the most data is received using point-to-point communication and examine the distribution of data received by each process. Compare with the number of Point-to-point Receive Communications and resulting MPI Point-to-point Communication Time.

If the aggregate Point-to-point Bytes Sent is greater than the amount received, some messages were cancelled, received into buffers which were too small, or simply not received at all. (Generally only aggregate values can be compared, since sends and receives take place on different callpaths and on different processes.) Applications do not conform to the MPI standard when they do not receive all messages that are sent, and the unreceived messages degrade performance by consuming network bandwidth and/or occupying message buffers. Cancelling receive operations may be necessary where speculative asynchronous receives are employed, however, managing the associated requests also involves some overhead.
Parent:
Point-to-point Bytes Transferred
Children:
None

Collective Bytes Transferred

Description:
Provides the total number of bytes that were notionally processed in MPI collective communication operations. This assumes that collective communications are implemented naively using point-to-point communications, e.g., a broadcast being implemented as sends to each member of the communicator (including the root itself). Note that effective MPI implementations use optimized algorithms and/or special hardware, such that the actual number of bytes transferred may be very different.
Unit:
Bytes
Diagnosis:
Expand the calltree to see where the most data is transferred using collective communication and examine the distribution of data transferred by each process. Compare with the number of Collective Communications and resulting MPI Collective Communication Time.
Parent:
Bytes Transferred
Children:
Collective Bytes Outgoing, Collective Bytes Incoming

Collective Bytes Outgoing

Description:
Provides the number of bytes that were notionally sent by MPI collective communication operations.
Unit:
Bytes
Diagnosis:
Expand the calltree to see where the most data is transferred using collective communication and examine the distribution of data outgoing from each process.
Parent:
Collective Bytes Transferred
Children:
None

Collective Bytes Incoming

Description:
Provides the number of bytes that were notionally received by MPI collective communication operations.
Unit:
Bytes
Diagnosis:
Expand the calltree to see where the most data is transferred using collective communication and examine the distribution of data incoming to each process.
Parent:
Collective Bytes Transferred
Children:
None

Remote Memory Access Bytes Transferred

Description:
Provides the total number of bytes that were processed by MPI one-sided communication operations.
Unit:
Bytes
Parent:
Bytes Transferred
Children:
Remote Memory Access Bytes Received, Remote Memory Access Bytes Sent

Remote Memory Access Bytes Received

Description:
Provides the number of bytes that were gotten using MPI one-sided communication operations.
Unit:
Bytes
Parent:
Remote Memory Access Bytes Transferred
Children:
None

Remote Memory Access Bytes Sent

Description:
Provides the number of bytes that were put using MPI one-sided communication operations.
Unit:
Bytes
Parent:
Remote Memory Access Bytes Transferred
Children:
None

Late Sender Instances (Communications)

Description:
Provides the total number of Late Sender instances ( see Late Sender Time for details) found in point-to-point communication operations.
Unit:
Counts
Parent:
Point-to-point Receive Communications
Children:
Late Sender, Wrong Order Instances (Communications)

Late Sender, Wrong Order Instances (Communications)

Description:
Provides the total number of Late Sender instances found in point-to-point communication operations were messages where sent in wrong order (see also Late Sender, Wrong Order Time).
Unit:
Counts
Parent:
Late Sender Instances (Communications)
Children:
None

Late Receiver Instances (Communications)

Description:
Provides the total number of Late Receiver instances (see Late Receiver Time for details) found in point-to-point communication operations.
Unit:
Counts
Parent:
Point-to-point Send Communications
Children:
None

Late Sender Instances (Synchronizations)

Description:
Provides the total number of Late Sender instances (see Late Sender Time for details) found in point-to-point synchronization operations (i.e., zero-sized message transfers).
Unit:
Counts
Parent:
Point-to-point Receive Synchronizations
Children:
Late Sender, Wrong Order Instances (Synchronizations)

Late Sender, Wrong Order Instances (Synchronizations)

Description:
Provides the total number of Late Sender instances found in point-to-point synchronization operations (i.e., zero-sized message transfers) where messages are received in wrong order (see also Late Sender, Wrong Order Time).
Unit:
Counts
Parent:
Late Sender Instances (Synchronizations)
Children:
None

Late Receiver Instances (Synchronizations)

Description:
Provides the total number of Late Receiver instances (see Late Receiver Time for details) found in point-to-point synchronization operations (i.e., zero-sized message transfers).
Unit:
Counts
Parent:
Point-to-point Send Synchronizations
Children:
None

MPI File Operations

Description:
Number of MPI file operations of any type.
Unit:
Counts
Diagnosis:
Expand the metric tree to see the breakdown of different classes of MPI file operation, expand the calltree to see where they occur, and look at the distribution of operations done by each process.
Parent:
None
Children:
MPI File Individual Operations, MPI File Collective Operations

MPI File Individual Operations

Description:
Number of individual MPI file operations.
Unit:
Counts
Diagnosis:
Examine the distribution of individual MPI file operations done by each process and compare with the resulting exclusive MPI File I/O Time.
Parent:
MPI File Operations
Children:
MPI File Individual Read Operations, MPI File Individual Write Operations

MPI File Individual Read Operations

Description:
Number of individual MPI file read operations.
Unit:
Counts
Diagnosis:
Examine the callpaths where individual MPI file reads occur and the distribution of operations done by each process in them.
Parent:
MPI File Individual Operations
Children:
None

MPI File Individual Write Operations

Description:
Number of individual MPI file write operations.
Unit:
Counts
Diagnosis:
Examine the callpaths where individual MPI file writes occur and the distribution of operations done by each process in them.
Parent:
MPI File Individual Operations
Children:
None

MPI File Collective Operations

Description:
Number of collective MPI file operations.
Unit:
Counts
Diagnosis:
Examine the distribution of collective MPI file operations done by each process and compare with the resulting MPI Collective File I/O Time.
Parent:
MPI File Operations
Children:
MPI File Collective Read Operations, MPI File Collective Write Operations

MPI File Collective Read Operations

Description:
Number of collective MPI file read operations.
Unit:
Counts
Diagnosis:
Examine the callpaths where collective MPI file reads occur and the distribution of operations done by each process in them.
Parent:
MPI File Collective Operations
Children:
None

MPI File Collective Write Operations

Description:
Number of collective MPI file write operations.
Unit:
Counts
Diagnosis:
Examine the callpaths where collective MPI file writes occur and the distribution of operations done by each process in them.
Parent:
MPI File Collective Operations
Children:
None

Remote Memory Access Synchronization Time

Description:
This pattern refers to the time spent in MPI remote memory access synchronization calls.
Unit:
Seconds
Parent:
MPI Synchronization Time
Children:
Late Post Time, Early Wait Time, Wait at Fence Time

Remote Memory Access Communication Time

Description:
This pattern refers to the time spent in MPI remote memory access communication calls, i.e. MPI_Accumulate, MPI_Put, and MPI_Get.
Unit:
Seconds
Parent:
MPI Communication Time
Children:
Early Transfer Time

Late Post Time

Description:
This pattern refers to the time spent in the MPI remote memory access 'Late Post' inefficiency pattern.


Late Post Example

Unit:
Seconds
Parent:
Remote Memory Access Synchronization Time
Children:
None

Early Wait Time

Description:
This pattern refers to idle time in MPI_Win_wait, due to an early call to this function, as it will block, until all pending completes have arrived.


Early Wait Example

Unit:
Seconds
Parent:
Remote Memory Access Synchronization Time
Children:
Late Complete Time

Late Complete Time

Description:
This pattern refers to the time spent in the 'Early Wait' pattern, due to a late complete call.


Late Complete Example

Unit:
Seconds
Parent:
Early Wait Time
Children:
None

Early Transfer Time

Description:
This pattern refers to the time spent waiting in MPI remote memory access communication routines, i.e. MPI_Accumulate, MPI_Put, and MPI_Get, due to an access before the exposure epoch is opened at the target.


Early Transfer Example

Unit:
Seconds
Parent:
Remote Memory Access Communication Time
Children:
None

Wait at Fence Time

Description:
This pattern refers to the time spent waiting in MPI fence synchronization calls for other participating processes.


Wait at Fence Example

Unit:
Seconds
Parent:
Remote Memory Access Synchronization Time
Children:
Early Fence Time

Early Fence Time

Description:
This pattern refers to the time spent in MPI_Win_fence waiting for outstanding remote memory access operations to this location to finish.


Early Fence Example

Unit:
Seconds
Parent:
Wait at Fence Time
Children:
None

MPI File Operations

Description:
Number of MPI file operations of any type.
Unit:
Counts
Parent:
None
Children:
MPI File Individual Operations, MPI File Collective Operations

MPI File Individual Operations

Description:
Number of individual MPI file operations.
Unit:
Counts
Parent:
MPI File Operations
Children:
MPI File Individual Read Operations, MPI File Individual Write Operations

MPI File Individual Read Operations

Description:
Number of individual MPI file read operations.
Unit:
Counts
Parent:
MPI File Individual Operations
Children:
None

MPI File Individual Write Operations

Description:
Number of individual MPI file write operations.
Unit:
Counts
Parent:
MPI File Individual Operations
Children:
None

MPI File Collective Operations

Description:
Number of collective MPI file operations.
Unit:
Counts
Parent:
MPI File Operations
Children:
MPI File Collective Read Operations, MPI File Collective Write Operations

MPI File Collective Read Operations

Description:
Number of collective MPI file read operations.
Unit:
Counts
Parent:
MPI File Collective Operations
Children:
None

MPI File Collective Write Operations

Description:
Number of collective MPI file write operations.
Unit:
Counts
Parent:
MPI File Collective Operations
Children:
None

Computational load imbalance heuristic

Description:
This simple heuristic allows to identify computational load imbalances and is calculated for each (call-path, process/thread) pair. Its value represents the absolute difference to the average exclusive execution time. This average value is the aggregated exclusive time spent by all processes/threads in this call-path, divided by the number of processes/threads visiting it.

Note: A high value for a collapsed call tree node does not necessarily mean that there is a load imbalance in this particular node, but the imbalance can also be somewhere in the subtree underneath.
Unit:
Seconds
Parent:
None
Children:
Computational load imbalance heuristic (values below average), Computational load imbalance heuristic (values above average)

Computational load imbalance heuristic (values below average)

Description:
This metric is provided as a convenience to identify processes/threads where the exclusive execution time spent for a particular call tree node was below the average value.

Please see Computational load imbalance heuristic for details on how this heuristic is calculated.
Unit:
Seconds
Parent:
Computational load imbalance heuristic
Children:
None

Computational load imbalance heuristic (values above average)

Description:
This metric is provided as a convenience to identify processes/threads where the exclusive execution time spent for a particular call tree node was above the average value.

Please see Computational load imbalance heuristic for details on how this heuristic is calculated.
Unit:
Seconds
Parent:
Computational load imbalance heuristic
Children:
None

SCALASCA    Copyright © 1998-2010 Forschungszentrum Jülich