Matrix computation is the core component of many scientific computing and machine learning methods.Effectively improving the performance of matrix computation is of great significance to the development of high-performance scientific computing systems or large data processing systems.Graphic Processor Unit(GPU)is an important parallel computing device.It can run a large number of threads at the same time.At the same time,it has great advantages in floating-point operation.So it has become the most commonly used parallel computing coprocessor.It is also an important computing platform for high performance matrix computation.However,parallel acceleration of matrix computation based on GPU still faces a challenge: because the parallel computing model of GPU is single instruction and multiple data,many matrix computation problems inevitably lead to data dependence problems.Therefore,how to solve the problem of parallel scheduling and computing of multi-data under data dependence is a difficult problem.Focusing on this core problem,this thesis studies the matrix computation model and algorithm design in scientific computing and machine learning respectively.The main research contents include optimization of dense matrix computation,solution of symmetric positive definite linear equations and matrix factorization with missing values.The main work of this dissertation are as follows:(1)For the optimization problem of dense matrix computation,this dissertation proposes a set of processor adaptive selection and optimization scheme,which can allocate the appropriate matrix computation to the appropriate processor.In this dissertation,five common dense matrix computations,such as vector inner product,matrix-vector multiplication,solving triangular vector equation,matrix-matrix multiplication,and solving triangular matrix equation,are studied by parallel analysis and mathematical modeling methods.Combining the hardware architecture features of CPU and GPU,a complete processor selection scheme is obtained.Finally,parallel analysis conclusions and mathematical models are verified on different sizes of vectors and matrices.Experiments show that the processor selection scheme given in this dissertation can effectively select the appropriate processor.Based on the processor selection scheme obtained in this dissertation,it can be used to develop adaptive high performance dense matrix computation program.(2)For the problem of solving sparse symmetric positive definite linear equations,this dissertation designs a preconditioning method that can efficiently utilize GPU-based matrix block preconditioning method based on complete inverse.The problem of sparse symmetric positive definite linear equations generally requires the use of preconditioned conjugate gradient method,and the most important problem in preconditioned conjugate gradient method is to choose a suitable preconditioning method.This dissertation first analyzes the shortcomings of the existing preconditioning methods on the GPU,and then designs the block preconditioner with direct inverse using the characteristic that GPU is suitable for batch dense matrix computation.The block preconditioner with direct inverse,the GPU implements and the performance analyisi of various methods on the GPU through mathematical models are given.Finally,experiments are performed on sparse symmetric positive definite linear equations with different element distributions.The block preconditioner with direct inverse is a reliable,stable and efficient preconditioning method,which is the most suitable method in the case where the element distribution of the equations cannot be predicted.(3)For the problem of computational resource waste in matrix factorization with missing values,this dissertation designs a new parallel task allocation method.The new task allocation method can remove the occupation of thread blocks by blank data blocks.Matrix factorization with missing values is mainly used for recommendation systems.However,existing GPU-based matrix factorization methods cause a large number of GPU threads to be in a state of no data execution when the number of users differs greatly from the number of items,resulting in inefficiency.This dissertation mainly analyzes the defects of the existing matrix partitioning method when the number of users differs greatly from the number of items,and designs a new matrix partitioning method and the GPU task allocation method.The new task allocation method can improve computational efficiency.Finally,experimental verification was performed on four different data sets.Experiments show that the new parallel task allocation method is more efficient than the existing methods.Based on the computational task allocation method proposed in this dissertation,the GPU can be efficiently used to solve the matrix factorization with missing values.(4)For the problem of unbalanced computing resource allocation in matrix factorization with missing values,this dissertation designs a vector parallel-based implementation method,which can make full use of the inherent parallelism of vector computing.This dissertation compares the different implementation between the existing stochastic gradient descent method and the gradient descent method of other optimization problems,carefully analyzes the performance difference between the two parallel methods,gives a new data partitioning and parallel implementation method,and finally verify the experiment on four different data sets.Experiments show that the implementation method based on vector parallel can improve the parallel computing efficiency.Based on the vector parallel implementation method proposed in this dissertation,the amount of data preprocessing computation can be reduced.At the same time,in the case of the same computing resources,with the new method GPU can solve the matrix factorization with missing values more efficiently. |