eXTreme fine-grAined concurrent taSKing framework

0 0
  • 0 Collaborators

XTASK enables extreme fine-grained parallelism across modern many-core architectures with hundreds of cores by implementing a novel lock-less multiple producer multiple consumer, out-of-order queuing mechanism for managing parallel tasks. ...learn more

Project status: Under Development

oneAPI, HPC

Groups
Student Developers for oneAPI

Intel Technologies
oneAPI

Code Samples [1]

Overview / Usage

Supporting fine-grained task parallelism is a significant challenge for hardware platforms that have rapidly increasing core counts. Concurrent data structures rely on hardware primitives to synchronize access to shared memory across many cores and threads of execution. Existing synchronization mechanisms do not scale on any modern architecture at high concurrency. Any parallel runtime that aims to support fine-grained parallelism across many threads of execution must be wary of using traditional synchronization techniques. Concurrent queues are typically built using these synchronization mechanisms and are a crucial building block in parallel runtime systems. With a goal of reducing the overheads due to synchronization in parallel runtime systems and reducing the task granularity, we propose XQueue as a novel lock-less multiple producer multiple consumer, out-of-order queuing mechanism that can scale up to at least hundreds of threads. We integrate XQueue with OpenMP which is a widely used parallel programming interface for shared memory architectures. This enables extremely fine-grained parallelism for native OpenMP applications which can run unmodified just by linking against our runtime library.

Methodology / Approach

We have built a lock-less concurrent framework for parallel runtime systems to enable fine-grained task parallelism by reducing the overheads of the underlying runtime. We have benchmarked several parallel applications and demonstrated the potential performance improvements that could be obtained on modern architectures with hundreds of cores. Load balancing is extremely important to parallel applications as imbalances quickly lead to sub-optimal execution times. The framework currently employs a static round-robin load balancing strategy to distribute tasks among threads. For dynamic work stealing, we have a promising lead on the implementation of a lock-less work stealing algorithm which we are currently looking into which should provide another round of performance improvements on real applications using our techniques.

Technologies Used

Intel oneAPI

Intel Vtune Profiler

Intel Advisor

Intel OpenMP

Intel Compilers

Repository

https://gitlab.com/pnookala/llvm-openmp.git

Comments (0)