Matrix multiplication
YuanGuang Chen
Unknown
- 0 Collaborators
Optimize matrix multiplication with oneAPI.This project improves the efficiency of matrix multiplication by implementing chunked matrix multiplication through parallel computing. ...learn more
Project status: Under Development
Overview / Usage
The general form of matrix multiplication (Matrix Multiplication) is C = A*B, and the dimensions of the three matrices A, B, and C are M*K, K*N, and M*N, respectively, and the data in all three matrices are single-precision floating-point numbers. The number of matrix multiplication operations using serial computation is 2*M*N*K, where M and N represent the two-level cyclic traversal of the Mth row of matrix A and the Nth column of matrix B, respectively, and K represents the cyclic traversal of each element of the Mth row of matrix A and the Nth column of matrix B. 2 refers to one multiplication and one addition of the innermost level of the three-level cycle. When the matrices M, N, and K are large, it leads to a sharp increase in the number of operations and the number of memory reads for the serial computation of matrix multiplication, and eventually the computation takes a huge amount of time.
In order to improve the efficiency of matrix multiplication, this experiment decided to divide the calculated matrix C into blocks, and the gpu performs the calculation of each working block in parallel, so as to achieve the improvement of the efficiency of matrix multiplication.
And by changing the way of reading data, the number of accesses to memory is reduced, thus reducing the access time and improving the efficiency of matrix multiplication.
Methodology / Approach
This project focuses on implementing the algorithm for matrix multiplication using DPC++, and studying parallel methods for dividing subtasks and setting cache blocks.
The general form of matrix multiplication is C = A*B, and the dimensions of the three matrices A, B, and C are M*K, K*N, and M*N. The number of matrix multiplication operations using serial computation is 2*M*N*K, where M and N represent the two-level cyclic traversal of the Mth row of matrix A and the Nth column of matrix B, respectively, and K represents the cyclic traversal of each element of the Mth row of matrix A and the Nth column of matrix B. The number of matrix multiplication operations using serial computation is 2*M*N*K, where M and N represent the two-level cyclic traversal of the Mth row of matrix A and the Nth column of matrix B, respectively. , 2 refers to one multiplication and one addition at the innermost level of the three-level loop.
In order to reduce the computational cost of matrix multiplication, this experiment decided to chunk the resulting matrix C into several blocks of the same size, and the computation of these working blocks should use some of the rows and columns of matrix A and matrix B, respectively. The computation of the working blocks is performed by the gpu in parallel, so as to achieve a parallel implementation of matrix chunking multiplication and improve the efficiency of matrix multiplication.
When computing the matrix in chunks, the number of memory reads does not change, but the computation time is reduced by using gpu parallelism. Therefore, setting the cache block, the size of which is cache_block_x* cache_block_y, and dividing the working block according to the cache block size in each block of operation, a single cycle of caching is sufficient to calculate the data of matrix A and matrix B of the cache block size, reducing the number of iterations, thus reducing the number of memory accesses, and achieving cache optimization for matrix chunking multiplication.