eXTreme fine-grAined concurrent taSKing framework
Poornima Nookala
Unknown
- 0 Collaborators
XTASK enables extreme fine-grained parallelism across modern many-core architectures with hundreds of cores by implementing a novel lock-less multiple producer multiple consumer, out-of-order queuing mechanism for managing parallel tasks. ...learn more
Project status: Under Development
Groups
Student Developers for oneAPI
Intel Technologies
oneAPI
Overview / Usage
Supporting fine-grained task parallelism is a significant challenge for hardware platforms that have rapidly increasing core counts. Concurrent data structures rely on hardware primitives to synchronize access to shared memory across many cores and threads of execution. Existing synchronization mechanisms do not scale on any modern architecture at high concurrency. Any parallel runtime that aims to support fine-grained parallelism across many threads of execution must be wary of using traditional synchronization techniques. Concurrent queues are typically built using these synchronization mechanisms and are a crucial building block in parallel runtime systems. With a goal of reducing the overheads due to synchronization in parallel runtime systems and reducing the task granularity, we propose XQueue as a novel lock-less multiple producer multiple consumer, out-of-order queuing mechanism that can scale up to at least hundreds of threads. We integrate XQueue with OpenMP which is a widely used parallel programming interface for shared memory architectures. This enables extremely fine-grained parallelism for native OpenMP applications which can run unmodified just by linking against our runtime library.
Methodology / Approach
We have built a lock-less concurrent framework for parallel runtime systems to enable fine-grained task parallelism by reducing the overheads of the underlying runtime. We have benchmarked several parallel applications and demonstrated the potential performance improvements that could be obtained on modern architectures with hundreds of cores. Load balancing is extremely important to parallel applications as imbalances quickly lead to sub-optimal execution times. The framework currently employs a static round-robin load balancing strategy to distribute tasks among threads. For dynamic work stealing, we have a promising lead on the implementation of a lock-less work stealing algorithm which we are currently looking into which should provide another round of performance improvements on real applications using our techniques.
Technologies Used
Intel oneAPI
Intel Vtune Profiler
Intel Advisor
Intel OpenMP
Intel Compilers