Design Of The Solver For Large Scale Linear Sparse Equations Based On CUDA

Posted on:2014-10-02

Degree:Master

Type:Thesis

Country:China

Candidate:C J Wu

Full Text:PDF

GTID:2268330401465884

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

Solving the large-scale linear equations has been an important work in scientificcomputation. With the development of graphics processor (GPU) hardware architecture,function of the GPU has been derived to GPU general computing. GPU, as acoprocessor of CPU, completes the large-scale data-intensive computing tasks. GPUcomputing power is almost equivalent of a small-sacle cluster. But comparing cluster,GPU has lower power comsumption and cost. In2007, NVIDIA released a new parallelcomputing platform, named CUDA which reduced the difficulty of using GPUaccelerated computing. With CUDA, it is more convenient to use GPU-acceleratedcomputing to solve the problem by researcher. This is enabled more and more scientificresearch fields to use GPU computing.Based on CUDA, the research content of this article is to achieve larg-scale linearequations on GPU. Compressed Row Storage (CSR) format is used to storage thelarge-scale sparse matrix. Conjugate gradient (CG) algorithm is adopted to solve thelarge-scale linear equations. In every iteration of the CG-algorithm, serveral scaledvector additons, dot products and a matrix-vector multiplication (SpMV) are computedby GPU. And the codes of products and SpMV are programed by the author, vectoradditons are realized the function by CUBLAS library.Because the sparse matrix issymmetric positive definite, story the upper triangle data of the matrix by CSRformat.During using the upper triangular data to SpMV, Decompose the SpMV intoMultiplication andAddition Operations. Using the lower triangular data to SpMV, itneeds the upper triangular data which are stored by CSR format. In the algorithm design,atomic operation is used to avoid different threads reading or writing the same addressat the same time. Because total amount of GPU global memory is6GBytes, thelarge-scale matrix is needed to divide into serval blocks, and every size of the blocks isless than the total amount of global memory. As the result, the data, which is computedby GPU, is suitable for the GPU capacity at every turn. This improves the utilization ofGPU. In the dot product operation, changing the offset to avoid the bank confict during reduce, which improves performance of the program. In this algrothm, the performace isbetter than using CUBLAS library to compute dot product. At last, adding Jocobiprecondition (JPCG) in CG algrothm to improve the convergence rate.Through the analysis the performace of multiple sets of different scale sparsematrix, the speedup of the SpMV operation executed by GPU is30, the speedup of dotproduct is6, and the biggest speedup of the solving equations is46, compared to CPUexecution. Those show that using the GPU to sovle the equations can be obtained thebetter efficiency. At the same time, the program can automatically adapt to the differentsize of the sparse matrix. The performance of the division algorithm which is designedin accordance with the capacity of GPU, is better than the algrothm which is designedwith dividing the sparse matrix into uniform blocks. The performance of the SpMVoperation using the data which is stored the upper triangle matrix is bad than using thedata which is stored the all elements of the matrix. But the algorithm which is sotringthe data of upper triangle matrix, can deal with more lager scale matrix than thealgorithm which is storing the data of the all elements of the matrix. The convergencerate of JPCG algorithm is faster than CG algorithm.

Keywords/Search Tags:

GPU, CUDA, Parallel Computing, CG, JPCG

PDF Full Text Request

Related items

1	Implementation Of Two-dimensional DFT Parallel Algorithm On CUDA
2	Research Based On CUDA Parallel Computation Of FFT
3	Design And Implementation Of Parallel SM4-GCM Based On CUDA
4	A Study On The Parallel Computing Methods Of Visual Hull Based On CUDA
5	Parallel Design And Implementation Of AP Clustering Algorithms Based On CUDA
6	A Parallel Image Stabilization Algorithm Based On CUDA
7	The Research Of Parallel FastSLAM Algorithm Based On CUDA
8	The Optimized Implementation Of Island-Based Genetic Algorithm On CUDA
9	Implementation Of Loop-closure Detection Base On CUDA
10	The Design And Implementation Of Subtraction CTA Algorithm Based On CUDA