The use and exploration of GPU in federated learning
huiting hou
Beijing, Beijing
- 0 Collaborators
This project introduces some exploration results from the use of GPU to accelerate homomorphic computing, conducts some feasibility analysis of using GPU to accelerate federated learning computing, and introduces some optimisation schemes. ...learn more
Project status: Concept
Overview / Usage
When using AI technology, all parties, organisations, or individual devices will generate individual data islands, and there is no good way for data between them to flow freely. How to perform data flow? The first problem faced is the inevitable data protectionism. This can be solved by some technical means. The solution proposed here is federated learning. Federal learning allows different organisations, such as hospitals, banks, terminal equipment or companies, to establish a federal network through federal learning to jointly tap the value of data. However, the use of federated learning inevitably requires complex algorithms such as homomorphic encryption operations. Compared with ordinary machine learning algorithms, the efficiency will be greatly reduced. This project introduces some exploration results from the aspect of using GPU to accelerate homomorphic computing, makes some feasibility analysis of using GPU to accelerate federated learning calculation, and introduces three optimisation schemes.
Methodology / Approach
We have done some feasibility analysis on the use of GPU to accelerate federated learning calculations. There are four main observations. The first is the data encryption and decryption and dense state calculations involved in federated learning. The calculation of different data basically does not affect each other. Therefore, it satisfies the characteristics of highly parallel computing, and this feature is very suitable for GPU acceleration; the second is that the calculation formula in federated learning is not complicated, but the number of repeated executions is huge, so it meets repeated lightweight calculations The GPU is suitable for accelerating repetitive lightweight calculations; the third is in federated learning, the data IO time is less than 0.1% of the calculation time, so it is a computationally intensive task, and the GPU is just suitable for use To accelerate computationally intensive tasks; finally, the data in federated learning is produced in batches, and the amount of data is huge, so it meets the characteristics of batch large data, and GPU is also suitable to accelerate the batch calculation of massive data.
Although federated learning has many properties suitable for GPU acceleration, it certainly faces some challenges. We mainly summarised three challenges. The first is that federated learning needs to do large integer operations. Large integers refer to 1024 bit, 2048 bit, or even larger large integers, but the problem is that the GPU's stream processor is in the hardware The above does not directly support the direct calculation of large integers. The second problem is that federated learning involves a lot of modular exponentiation calculations, but GPU are not good at doing modular exponentiation calculations. If GPU can directly perform modular exponentiation calculations, its cost is very high. The third is that the calculation of federated learning needs to save a large amount of intermediate calculation results, but due to cost and energy consumption constraints, GPU memory is very limited.
For the first challenge, the solution we adopted is based on the idea of divide and conquer to do element-level parallelism. The large integer multiplication can be decomposed into many small integer multiplications calculated in parallel through recursion, so that the GPU can be used to directly perform the small integer multiplication operations. For other operations involved in federated learning, similar element-level parallelism can also be done.
Aiming at the second challenge of federated learning, we mainly use the square multiplication algorithm + Montgomery algorithm to solve the problem of high cost of modular exponentiation. The advantage of the square multiplication algorithm is that the complexity of the entire calculation is reduced to O(N) and the intermediate calculation result does not exceed c. But its disadvantage is that it still needs to perform 2N times of modulo operation. For GPU, the cost of modulo is very high. In order to solve the problem of high cost of modular operation, we further introduce the Montgomery modular multiplication algorithm. Its biggest feature is that it can calculate the value of the modulo multiplication without performing the modulo operation, thus completely avoiding the costly modulo operation.
For the third challenge, we use the Chinese remainder theorem. This method can avoid the use of larger video memory to buffer the intermediate results; in addition, the modular exponentiation operation of the 2N-bit number is simplified to the modular exponentiation operation of the N-bit number, and the amount of calculation is greatly reduced.
Finally, let's look at an evaluation result of GPU accelerated federated learning calculation after using the above three methods. We mainly compared the dense state calculations involved in four types of federated learning, including homomorphic encryption, homomorphic decryption, dense state multiplication, and dense state addition. For the more complex calculations of homomorphic encryption and homomorphic decryption, there is almost 6 times speedup, and for the relatively simple calculations of dense state multiplication and dense state addition, it has achieved dozens of times and hundreds of times of speedup. This is actually just a preliminary result. We believe that there is actually more room for using GPU to accelerate federated learning calculations, and it is worth exploring together.