Font Size: a A A

Design And Implementation Of Batch Matrix Multiplication Framework For High Performance Computing

Posted on:2023-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:R M WangFull Text:PDF
GTID:2558306830491304Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The Basic Linear Algebra Subroutine(BLAS)is an interface standard for a range of basic linear algebraic arithmetic functions that has long been used in various areas of scientific computing and industry,and has played a vital role in modern scientific fields and industrial development.Classical BLAS tends to have much better performance for large inputs or problems(large matrix or vector operations).However,for scenarios where the problem size is small but the number of problems is large,the underutilization of resources could lead to significantly poor performance.Therefore,the concept and initial implementation of Batch Basic Linear Algebra Subroutine(Batch BLAS)has been proposed in recent years to address the challenges posed by this emerging trend.It solves the dilemma of poor performance for such scenarios in classical BLAS by processing multiple sub-problems in parallel by batch at the same time.For batch matrix multiplications,the current GPU-based BLAS libraries(Cu BLAS,Roc BLAS)only support fixed-size or uniform problem input,but not variable-size matrix computation,which greatly limits its application scenarios.The aim of this thesis is to design and implement a GPU-based framework for batch variable-size matrix multiplication for high performance computing,in order to address the poor performance of existing libraries in such scenarios.The research covers four areas: the analysis and optimization of MAGMA vbatched routines,the design and implementation of a fine-grained variable-size matrix multiplication kernel function,the design and implementation of a variable-size matrix multiplication framework,and the application and optimization of HPCC DGEMM.This thesis analyzes the input scenarios of batch variable size matrix multiplication,and presents the advantages and drawbacks of the industry’s best MAGMA implementation on GPUs,which provides room for further optimization.In the design and implementation of finegrained batch variable size matrix multiplication kernel functions,the problem inputs are finegrained for this application scenario,and the GPU hardware architecture is researched in the memory level to achieve better parallel in kernel function.In the design and implementation of the batch variable size matrix multiplication framework,this thesis proposes a fine-grained kernel function optimization,a batch order optimization,and an extreme data input adjustment to improve the computational performance of the framework.In the application and optimization section of HPCC DGEMM,this thesis optimizes this benchmark with task-based partitioning and fine-grained partitioning to improve the performance of systems with multiple GPU in current high-performance computing fields.This thesis also improves and optimizes the current batch matrix multiplication method to extend the support and applicability of the BLAS library for batch operations,in the hope of addressing new challenges that are currently emerging in the fields of high-performance computing,machine learning,and scientific computing.
Keywords/Search Tags:GPU, BLAS, batched GEMM, High Performance Computing
PDF Full Text Request
Related items