Evaluating the Suitability of Intel oneAPI for Fine-Grained Parallelism
Owen McGrath
Chicago, Illinois
- 0 Collaborators
We explored the Intel oneAPI programming framework, specifically focusing on DPC++ and oneTBB, and evaluated how suitable it is for very fine-grained parallelism. We ran benchmarks of oneTBB and compared the results to other parallel programming solutions such as OpenMP and XQueue. ...learn more
Project status: Under Development
Groups
Student Developers for oneAPI
Intel Technologies
oneAPI,
Intel vTune
Overview / Usage
The goal of the project was to become familiar with Intel oneAPI, specifically DPC++ and oneTBB, and then evaulate their suitability for very fine-grained parallelism. An example of a very-fined grained workload is recursive Fibonacci, which creates a task for each step of the computation that may consist of only a few instructions. Some parallel programming libraries struggle performance-wise when dealing with tasks of this size, and we wanted to determine if oneTBB fit into this category or not.
We concluded that DPC++ and oneTBB are both well-suited for very-fined grained workloads, and specifically that oneTBB outperformed similar parallel programming libraries in many of our tests. We have written a technical report detailing the results of our testings and our findings.
Methodology / Approach
For our testing, we created a series of benchmarking programs. We used our benchmarks to measure the maximum theoretical throughput in tasks per second, using noop tasks; measured the load balancing capabilities of the scheduler by keeping track of how many tasks each logical thread was processing during execution; and compared the performance of oneTBB to other parallel programming libraries. Specifically, we looked at the GNU and LLVM implementations of OpenMP and XQueue, which is an OpenMP implementation that uses a completely lock-free scheduler. We converted several OpenMP benchmarking suites to use oneTBB.
We tried investigating oneTBB's source code to learn how it is able to achieve its level of performance. Although we had some struggles doing so, after the report was written, we were able to meet with Intel employees to get more details about oneTBB's implementation which explained its performance, and learned why our changes did not have the expected effect.
Technologies Used
As mentioned previously, we used DPC++ and oneTBB. We also consulted Intel vTune during our investigation into oneTBB's source code. We ran all of our benchmarks on a 192-core, 384-thread, Xeon-powered supercomputing node at our university, which allowed us to run our tests at very high levels of parallelism.