The Great CEED Bake-off: DPC++ Edition

Kris Rowe

Kris Rowe

Lemont, Illinois

0 0
  • 0 Collaborators

The CEED Bake-off Problems are a collection of benchmarks representing compute-intensive kernels relevant to spectral element methods, such as those used in the Nek5000 CFD code. A DPC++ implementation of the benchmarks will be used to set performance baselines for CEED apps on Intel CPUs and GPUs. ...learn more

Project status: Under Development

oneAPI, HPC

Intel Technologies
oneAPI, DPC++, Intel Iris Xe, Intel Iris Xe MAX, Intel vTune, MKL, Intel CPU

Docs/PDFs [1]Links [5]

Overview / Usage

The Center for Efficient Exascale Discretizations (CEED) is a co-design effort within the U.S. Department of Energy's Exascale Computing Project (ECP) focused on high-order finite element and spectral element methods. A collection of benchmarks—known as the CEED Bake-off Problems—represent important compute-intensive kernels and solvers relevant to computational science and engineering applications such as Nek5000, MFEM, and libParanumal.

A DPC+ + implementation of the CEED Bake-off Problems—designed and developed by scientists at Argonne National Laboratory—provides a tool to establish performance baselines for CEED applications on Intel CPUs and GPUs. Additionally, it provides a sandbox to explore the replacement of directly programmed DPC+ + kernels with batched routines available through the Intel oneMKL BLAS-like extensions.

The lessons learned and information gained through this project will be invaluable in efforts to prepare for the upcoming Aurora supercomputer.

Methodology / Approach

Benchmarks share a core set of functions for Gaussian quadrature nodes and weights, 1D interpolation, derivative matrices, constructing 2D/3D meshes, verification of kernel accuracy, and timing and statistics.

Memory

DPC+ + USM is used and all memory is allocated on the device. Host-to-device memcpy occurs before kernel calls, isn’t included in timings. DPC++ events are used to explicitly manage memory dependencies between kernels.

Kernels

Two different types of kernels are used: directly programmed and oneMKL. Directly programmed DPC+ + kernels are defined using lambdas wrapped in a C+ + function called from the host. Kernels are called once for accuracy verification before any measurement to trigger JIT compilation so that it is not included in timings. Kernels using oneMKL wrap one or more oneMKL calls inside of a C+ + function called from the host.

Accuracy Verification

A serial C+ + version of each kernel is used to calculate a reference solution on the host

Measurement

Each kernel is run multiple times and DPC++ event profiling is used to measure execution time. Metrics, such as the mean, maximum/minimum, and variance of the execution time are calculated. Additionally, runtime is recorded on the host using `std::chrono` for comparative purposes.

Technologies Used

Software

  • Intel oneAPI Base Toolkit
    • Intel oneAPI DPC++ Compiler
    • Intel oneAPI Math Kernel Library (DPC++ interface)
    • Intel Distribution for GDB
    • Intel Advisor
    • Intel VTune Profiler
  • CMake
  • Visual Studio Code

Hardware

  • Intel Iris Xe (Gen9) GPU
  • Intel Iris Xe MAX GPU
  • Intel Xe-HP GPU
  • Argonne National Laboratory's JLSE Testbeds

Acknowledgments

This work was supported by Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357 and by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation’s exascale computing imperative.

Documents and Presentations

Comments (0)