| In the field of complex system simulation,a large number of small and mediumsized nonlinear simulation model units need to be solved approximately in the form of linearization.For non-positive definite dense matrix,LU decomposition method is usually used.This kind of decomposition based on Gaussian Elimination is very costly,and the calculation process is quite time-consuming,which seriously restricts the speed of simulation.Especially in the current situation where the simulation scene is increasingly complex,this problem is more prominent.Therefore,the efficient solution method of small and medium-sized linear equations is of great significance for the rapid advancement of the simulation.In view of the above problems,this thesis uses the CUDA programming technology launched by NVIDIA to study the parallel solution method of small and medium-sized linear equations,and makes parallel improvement and optimization based on GPU on the traditional LU decomposition algorithm.The main work and contributions include:(1)This thesis designs and implements a high-performance batched parallel LU decomposition algorithm for small and medium-sized dense matrices.The thesis gives eight different versions of batched parallel LU decomposition algorithms,making full use of hardware features such as data reorganization,global memory coalescing access,local variable cache.The algorithms effectively hide the memory access delay and increase the proportion of effective computing time.The experimental results show that with the increase of the number of batches,the performance of the algorithm increases in an approximate linear trend,with the peak value close to 450Gflops/s.Compared with NVIDIA CUBLAS library,the maximum acceleration ratio is close to 18.(2)Based on the LU decomposition algorithm,this thesis designs and implements an implicit parallel algorithm for solving large batch of small and medium-sized linear equations.The algorithm utilizes a right looking parallel back substitution process,which can effectively accelerate the solution speed.The test results of specific cases show that the average solution speed of the algorithm is more than 3 times of the batch linear equations solution API provided by NVIDIA CUBLAS library.The algorithm has implicit real-time parallel solution ability to support million-scale small simulation models. |