Gal Oren
Haifa, Haifa District
Unknown
This project includes the implementation of In-NVRAM Exact State Reconstruction (ESR) for the PCG solver, as we describe in https://arxiv.org/pdf/2204.11584.pdf. In this work we also plan to research recoverability (with NVRAM) of concurrent applications using OpenMP. ...learn more
Project status: Under Development
Intel Technologies
Optane
Iterative linear solvers are main kernels in scientific applications. Exact State Reconstruction (ESR) techniques have been proposed in the last decaded, and rely on RAM to create redundancies for certain state variables of the solver. Our research investigaes how can NVRAM be utilized to decrease memory overheads and time pverheads of what we call, In-RAM ESR. This project is a reuslt of our research that investigates how to dramatically enhance in-RAM (ESR) performances, given all of the technological changes over the course of the last decade since firstly introduced, and eliminate its main problems, that is, extended memory footprint and constant surge of network traffic. Our work rests on three pillars: (1) recently enabled capabilities of direct access (DAX) to NVRAM, (2) the access to such memory with MPI One-Sided Communication (OSC) over RDMA, and (3) the observation that these two capabilities allow to keep on all of the qualities of original in-RAM ESR while persisting just one copy of recovery data every persistence cycle instead of many redundancies. This yields the enhanced in-NVRAM ESR, which instead of relying on and populating the RAM with many redundancies for fault tolerance, sends just one copy DAX-wise through RDMA directly to the persistent NVRAM. Accessing byte-addressable NVRAM directly, without the latency of moving data to and from the I/O bus, with comparable performances to RAM, and with a small overhead, creates a much advanced ESR mechanism, without compromising data and recovery consistency.
We implement in-NVRAM ESR with our new library of MPI One-Sided Communication (OSC) over RDMA under the setting of NVRAM, and study two possible NVRAM placements architectures:
In the PRD sub-cluster architecture, we assume RAID between nodes to provide fault tolerance to errors in the sub-cluster. Otherwise, each node of the sub-cluster behaves as a single point of failure. We stress that while in-RAM ESR's data transportation increases quadratically with the cluster size, the increase writes of RAID is linear and depends on RAID level.
In this work we also plan to research recoverability (with NVRAM) of concurrent applications using OpenMP.
This project is a reuslt of our research that investigates how to dramatically enhance in-RAM Exact State Reconstruction (ESR) performances, given all of the technological changes over the course of the last decade since firstly introduced, and eliminate its main problems, that is, extended memory footprint and constant surge of network traffic. Our work rests on three pillars: (1) recently enabled capabilities of direct access (DAX) to NVRAM, (2) the access to such memory with MPI One-Sided Communication (OSC) over RDMA, and (3) the observation that these two capabilities allow to keep on all of the qualities of original in-RAM ESR while persisting just one copy of recovery data every persistence cycle instead of many redundancies.
This project is a reuslt of our research that investigates how to dramatically enhance in-RAM Exact State Reconstruction (ESR) performances, given all of the technological changes over the course of the last decade since firstly introduced, and eliminate its main problems, that is, extended memory footprint and constant surge of network traffic. Our work rests on three pillars: (1) recently enabled capabilities of direct access (DAX) to NVRAM, and specifically with the PMDK library, (2) the access to such memory with MPI One-Sided Communication (OSC) over RDMA, and (3) the observation that these two capabilities allow to keep on all of the qualities of original in-RAM ESR while persisting just one copy of recovery data every persistence cycle instead of many redundancies.
https://github.com/Scientific-Computing-Lab-NRCN/In-NVRAM-ESR.git