TAU Performance System

Sameer Shende

Sameer Shende

Eugene, Oregon

1 0
  • 0 Collaborators

The TAU Performance System® supports profiling and tracing of programs written using the Intel OneAPI. Intel OneAPI provides two interfaces for programming - OpenCL and DPC++/SYCL for CPUs and GPUs. TAU supports both - the OpenCL profiling interface and Intel Level Zero API to observe performance. ...learn more

Project status: Published/In Market

oneAPI, HPC

Intel Technologies
oneAPI, DPC++, Intel Integrated Graphics, DevCloud

Docs/PDFs [1]Code Samples [1]Links [1]

Overview / Usage

The TAU Performance System® supports profiling and tracing of programs written using the Intel OneAPI. Intel OneAPI provides two interfaces for programming - OpenCL and DPC++/SYCL for CPUs, GPUs, and other devices. With TAU, a user can observe the performance of the program both at the CPU and the GPU level. At the GPU level, TAU support the OpenCL profiling interface as well as the Intel Level Zero API. With these two interfaces, it is possible to track the precise timings of kernels executing on the GPU and observe the data transfers between the host and the accelerator device. TAU has been tested on Intel Gen12 GPUs including the TigerLake platform and on Gen9 GPUs (such as the Iris system at the ALCF and the Intel DevCloud) using the Intel BaseKit and HPCToolkit software stacks from OneAPI. 

Methodology / Approach

Instrumenting and measuring an application’s performance is the first step towards optimizing it. Typically, the process of instrumenting the application with source code and build system modifications is viewed as cumbersome, but it does not have to be. Tools such as TAU [http://tau.uoregon.edu] can operate on an un-modified binary to generate detailed summary statistics (profiles) and event-traces. Observing application performance at the statement, loop, and function level is now possible on a per-MPI rank, thread, and even GPU kernel level using TAU. TAU is a versatile profiling and tracing toolkit that is widely ported to HPC platforms. In its latest release TAU v2.30 supports GPUs from Intel® using Level Zero (for Gen19LP GPUs). It also supports the OpenACC, OpenCL, and the OpenMP Tools (OMPT) Interface that is now part of the OpenMP 5.0 standard. These interfaces allow TAU to transparently intercept runtime system calls and measure the performance of key code regions using timer calls and track the arguments that flows through the runtime library functions. To use TAU, the user launches the application using the tau_exec tool with command optional parameters to enable various runtimes (for e.g., l0 for the Level Zero instrumentation or -opencl for Intel GPUs). While profiles show the aggregate statistics based on exclusive and inclusive durations and samples collected, traces offer a temporal view of the performance data typically along a timeline display where each process is shown along a time axis. After the experiment is concluded, TAU profiles can be viewed using pprof, or using the GUI, paraprof. TAU’s paraprof browser includes a 3D profile browser window where code regions, MPI ranks (and threads) and exclusive time spent in those code regions are all shown in a 3D plot that can be examined. TAU can also generate OTF2 traces natively library to generate trace files. These files can be visualized in the Vampir trace visualization tool from TU Dresden. This provides a powerful mechanism to generate low-level trace data about kernel execution along a timeline for the application.

Technologies Used

This project uses Intel OneAPI, DPC++/SYCL, OpenCL, and Intel compilers and runtime libraries.

Performance evaluation tools that can expose the inner workings of the runtime system and slice through layers of the runtime while sampling the rest of the application codebase provide a hybrid view of performance data. TAU has been extended recently to work with Intel OneAPI® GPU runtimes such as Level Zero as well as portable runtimes such as OpenACC, Kokkos, OpenMP, and MPI. TAU can generate both profiles as well as traces and can help shed light on the application bottlenecks for both CPUs and GPUs with powerful performance data visualization tools. Given that it needs no modifications to the application binary, TAU is an easy tool to deploy while investigating application performance. TAU is available for download from the TAU webpage and is part of the Extreme-scale Scientific Software Stack (E4S), a curated Spack based release of HPC and AI/ML packages for containerized deployment and bare-metal builds. TAU and E4S are supported by the US DOE Exascale Computing Project (ECP).

Acknowledgement

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative.

Documents and Presentations

Repository

https://github.com/UO-OACISS/tau2

Comments (0)