Design And Implementation Of Batch Matrix Multiplication Framework For High Performance Computing

Posted on:2023-10-26

Degree:Master

Type:Thesis

Country:China

Candidate:R M Wang

Full Text:PDF

GTID:2558306830491304

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The Basic Linear Algebra Subroutine(BLAS)is an interface standard for a range of basic linear algebraic arithmetic functions that has long been used in various areas of scientific computing and industry,and has played a vital role in modern scientific fields and industrial development.Classical BLAS tends to have much better performance for large inputs or problems(large matrix or vector operations).However,for scenarios where the problem size is small but the number of problems is large,the underutilization of resources could lead to significantly poor performance.Therefore,the concept and initial implementation of Batch Basic Linear Algebra Subroutine(Batch BLAS)has been proposed in recent years to address the challenges posed by this emerging trend.It solves the dilemma of poor performance for such scenarios in classical BLAS by processing multiple sub-problems in parallel by batch at the same time.For batch matrix multiplications,the current GPU-based BLAS libraries(Cu BLAS,Roc BLAS)only support fixed-size or uniform problem input,but not variable-size matrix computation,which greatly limits its application scenarios.The aim of this thesis is to design and implement a GPU-based framework for batch variable-size matrix multiplication for high performance computing,in order to address the poor performance of existing libraries in such scenarios.The research covers four areas: the analysis and optimization of MAGMA vbatched routines,the design and implementation of a fine-grained variable-size matrix multiplication kernel function,the design and implementation of a variable-size matrix multiplication framework,and the application and optimization of HPCC DGEMM.This thesis analyzes the input scenarios of batch variable size matrix multiplication,and presents the advantages and drawbacks of the industry’s best MAGMA implementation on GPUs,which provides room for further optimization.In the design and implementation of finegrained batch variable size matrix multiplication kernel functions,the problem inputs are finegrained for this application scenario,and the GPU hardware architecture is researched in the memory level to achieve better parallel in kernel function.In the design and implementation of the batch variable size matrix multiplication framework,this thesis proposes a fine-grained kernel function optimization,a batch order optimization,and an extreme data input adjustment to improve the computational performance of the framework.In the application and optimization section of HPCC DGEMM,this thesis optimizes this benchmark with task-based partitioning and fine-grained partitioning to improve the performance of systems with multiple GPU in current high-performance computing fields.This thesis also improves and optimizes the current batch matrix multiplication method to extend the support and applicability of the BLAS library for batch operations,in the hope of addressing new challenges that are currently emerging in the fields of high-performance computing,machine learning,and scientific computing.

Keywords/Search Tags:

GPU, BLAS, batched GEMM, High Performance Computing

PDF Full Text Request

Related items

1	Research On Key Issues Of Performance Optimization In High Performance Computing Based On The Godson
2	Optimization And Implementation Of BLAS3 Function Based On FT-2000+
3	Optimization Of Irregular Batched Matrix Multiplications On New Generation Shenwei Many-core Processor
4	On the role of search in generating high-performance BLAS libraries
5	Fpga-based Blas To Accelerate System Design And Research
6	Research On The Performance Modeling Of Spark Streaming
7	Implementation And Optimization Of High-precision Summation And Dot Product Algorithms Based On Shenwei 162
8	Research On Parallel Optimization Of BLAS Based On The New Generation Of Sunway Many-core Processor
9	Based On The Shenwei 1621 Platform BLAS Primary And Secondary Function Optimization Research
10	Design And Implementation Of High Performance Parallelization Level3 BLAS On Multicore DSP