Distributed k-Nearest Neighbors using Locality-Sensitive Hashing and SYCL

Marcel Breyer

Marcel Breyer

Stuttgart, Baden-Württemberg

0 0
  • 0 Collaborators

In the age of AI, algorithms must efficiently cope with vast data sets. We propose a performance-portable implementation of Locality-Sensitive Hashing (LSH), an approximate k-nearest neighbors algorithm, using different SYCL implementations—ComputeCpp, hipSYCL, DPC++—supporting multiple GPUs. ...learn more

Project status: Published/In Market

oneAPI, HPC

Intel Technologies
DevCloud, DPC++, Intel Iris Xe MAX

Docs/PDFs [1]Code Samples [1]Links [2]

Overview / Usage

In the age of data collection, machine learning algorithms have to be able to efficiently cope with vast data sets. This requires scalable algorithms and efficient implementations that can cope with heterogeneous hardware. We propose a new, performance-portable implementation of a well-known, robust, and versatile multi-class classification method that supports multiple Graphics Processing Units (GPUs) from different vendors. It is based on a performance-portable implementation of the approximate k-nearest neighbors (k-NN) algorithm in SYCL. The k-NN assigns a class to a data point based on a majority vote of its neighborhood. The naive approach compares a data point x to all other data points in the training data to identify the k nearest ones. However, this has quadratic runtime and is infeasible for large data sets. Therefore, approximate variants have been developed. Such an algorithm is the Locality-Sensitive Hashing (LSH) algorithm, which uses hash tables together with locality-sensitive hash functions to reduce the data points that have to be examined to compute the k-NN.

To the best of our knowledge, there is no distributed LSH version supporting multiple GPUs from different vendors available so far despite the fact that k-NNs are frequently employed. Therefore, we have developed the library. It provides the first hardware-independent, yet efficient and distributed implementation of the LSH algorithm that is suited for modern supercomputers. The implementation uses C++17 together with SYCL 1.2.1, which is an abstraction layer for OpenCL that allows targeting different hardware with a single implementation. To support large data sets, we utilize multiple GPUs using the Message Passing Interface (MPI) to enable the usage of both shared and distributed memory systems.

We have tested different parameter combinations for two locality-sensitive hash function implementations, which we compare. Our results show that our library can easily scale on multiple GPUs using both hash function types, achieving a nearly optimal parallel speedup of up to 7.6 on 8 GPUs. Furthermore, we demonstrate that the library supports different SYCL implementations—ComputeCpp, hipSYCL, and DPC++—to target different hardware architectures without significant performance differences.

Methodology / Approach

Locality Sensitive Hashing (LSH) can be used to compute the approximate k-nearest neighbors. Normally, hash functions aim to minimize the number of collisions when hashing: They ensure for two points x and y that the probability of having the same hash value h(x ) = h(y) is small. In contrast, locality-sensitive hash functions aim to maximize the probability of hash collisions if two points x and y are **close or similar to each **other with respect to a chosen distance or similarity metric.

Two such locality-sensitive hash functions are implemented in our sycl_lsh library.

The Random Projections divide the numbers line into segments of the same size. The hash value of a point x is the id of the segment on which x is mapped. The advantages are that the hash functions are efficient to create and h(x) is efficient to compute. However, the data points are not uniformly distributed over the segments resulting in potential load imbalances.

The Entropy-Based hash functions divide the numbers line into segments such that each segment has the same number of data points. Using these hash functions h(x) can also be efficiently computed and the data points are uniformly distributed over the segments (per construction). However, the entropy-based hash functions are expensive to create.

Our implementation can facilitate multiple GPUs using the Message Passing Interface (MPI) and can therefore also be used in a distributed system setting. It also supports multiple different SYCL implementations: DPC++, ComputeCpp, and hipSYCL. Two different evaluation metrics are implemented: the recall (the number of correctly calculated nearest neighbors divided by k) and the error ratio (the relative error in the distances between the correct and calculated nearest neighbors). The second metric is better suited for an approximate k-nearest neighbor algorithm, since calculated nearest neighbors that are not the exact nearest neighbors but very close to them may be good enough.

We compared our implementation using the three different SYCL implementations on systems with an Intel, NVIDIA, and AMD GPU respectively and on a system with 8 NVIDIA GTX 1080 Ti GPUs. A few noticeable results:

  • Random Projections:
    • ComputeCpp has slightly higher overheads for small recall values (i.e. smaller runtimes) on the NVIDIA RTX 3080 compared to DPC++ and hipSYCL
    • hipSYCL on the AMD GPU has also higher overheads for small recall values
    • the performance of hipSYCL on the Intel GPU is way worse compared to DPC++ and ComputeCpp due to its highly experimental implementation status (e.g. no support for local memory)
  • Entropy-Based Hash Functions:
    • ComputeCpp has, again, slightly higher overheads for small recall values on the NVIDIA RTX 3080 compared to DPC++ and hipSYCL
    • DPC++ and ComputeCpp have high overheads on the NVIDIA GTX 1080 Ti GPU but not on the RTX 3080 GPU (caused by creating the hash functions)
    • hipSYCL's performance on the Intel GPU is again extremely bad compared to DPC++ and ComputeCpp (again caused by the highly experimental implementation status)

In general, the entropy-based hash functions are faster (and use less memory, since less hash tables were needed) for high recall values compared to the random projections.
Additionally, we observe overall similar runtime characteristics for all three SYCL implementations.

Speedup on up-to 8 GPUs:

  • friedman data set (500'000 x 10):
    • scaling in general worse compared to the larger HIGGS data set
    • DPC++ and hipSYCL scale better than ComputeCpp
    • using the entropy-based hash functions DPC++ and ComputeCpp don't scale at all (if the runtime for the creation of the hash functions is neglected both SYCL implementations do scale)
  • HIGGS data set (1'000'000 x 27):
    • DPC++ has for both locality-sensitive hash functions a near optimal speedup of up-to 7.6 on 8 GPUs reducing the runtime from 37.6s to 4.9s
    • hipSYCL scales slightly worse than DPC++ while ComputeCpp scales worse than hipSYCL

Special thanks are due to Intel for providing access to the DevCloud and to M.Sc. Gregor Daiß and Prof. Dr. Dirk Pflüger from the University of Stuttgart.

Technologies Used

Hardware and Software:

  • Intel i9-10920X @ 3.5 GHz + Intel Iris Xe MAX
    • DPC++ (Intel LLVM fork)
    • ComputeCpp (v2.5.0)
    • hipSYCL (v0.9.1)
  • Intel i9-10980XE @ 3.0 GHz + NVIDIA RTX 3080
    • DPC++ (Intel LLVM fork)
    • ComputeCpp (v2.5.0)
    • hipSYCL (v0.9.1)
  • AMD EPYC 7551P @ 2.0 GHz + AMD Radeon VII (VEGA 20)
    • hipSYCL (v0.9.1)
  • Intel Xeon Gold 5120 @ 2.2 GHz + 8x NVIDIA GTX 1080 Ti
    • DPC++ (Intel LLVM fork)
    • ComputeCpp (v2.5.0)
    • hipSYCL (v0.9.1)

Documents and Presentations

Repository

https://github.com/SC-SGS/Distributed_GPU_LSH_using_SYCL

Comments (0)