Performance and Portability Evaluation of the K-Means Algorithm on SYCL with CPU-GPU architectures

Youssef

Youssef

Madrid, Community of Madrid

1 0
  • 0 Collaborators

This work uses the k-means algorithm to asses the performance portability of one of the most advanced implementations of the literature, He-Vialle, over different programming models (DPC++, CUDA, OpenMP) and multi-vendor CPU-GPU architectures. ...learn more

Project status: Published/In Market

oneAPI, HPC, Artificial Intelligence

Intel Technologies
DevCloud, oneAPI, DPC++, Intel vTune, Intel CPU, Other, XeSS

Code Samples [1]Links [1]

Overview / Usage

The SYCL arrival was focused on exploiting the over-increasing number of accelerators in HPC while keeping the code portability. Among others, Intel oneAPI offers a toolkit where found Intel's SYCL implementation, Data Parallel C (DPC), specialised libraries, or even the SYCLomatic tool which ports code from CUDA to DPC++.

In order to test the SYCL's performance portability, this paper uses as a case study the k-means algorithm, due to its code simplicity but complex optimization. As a starting point, we have considered the most efficient implementation on CUDA Nvidia GPUs as well as on multi-core based on OpenMP. Our resulting SYCL code proposal, in turn, can potentially run on any GPU or CPU. Additionally, we have added an SYCL hand-tunned variation depending on the device architecture (CPU, Nvidia GPU and Intel GPU) to evaluate the performance gap between a standard version to an optimised one which takes advantage of the target device features.

The resulting tests show that while the SYCL code outperforms Intel GPUs and CPUs compared to the original version, on Nvidia GPUs is quite inefficient due to the issues found in the compiler.

Methodology / Approach

In the first instance, we went through the literature and found the He-Vialle implementation of the k-means algorithm. That implementation was developed on CUDA for Nvidia GPUs and OpenMP to target CPUs. Then we used the SYCLomatic oneAPI tool to port the CUDA code to DPC++.

After that, we test the resulting code over an Nvidia GPU (1050 Ti) taking advantage of the Intel LLVM open-source compiler. However, running the ported code over an Intel CPU (Intel Xeon Platinum 8358) and GPU (Iris Xe MAX DG1) resulted in poor performance. To solve that, Intel VTune helped on identifying the main bottlenecks on both architectures, and we finally adapt the He-Vialle OpenMP implementation to DPC++ to run on CPUs. Regarding the Iris DG1, the CPU code was adapted to take advantage of the Intel Iris Xe GPU architecture.

Technologies Used

  • SYCLomatic
  • DPC++ compiler
  • Intel V-tune
  • Intel Advisor

Repository

https://github.com/artecs-group/k-means

Comments (0)