Performance Analysis And Optimization Of General Matrix-vector Operations Based On Shenwei 1621

Posted on:2022-11-16

Degree:Master

Type:Thesis

Country:China

Candidate:J Deng

Full Text:PDF

GTID:2518306755960859

Subject:Control Engineering

Abstract/Summary:

PDF Full Text Request

According to the analysis of the development status of HPC,HPC is closely connected with computing power service and artificial intelligence,which shows that HPC has an irreplaceable position in the frontier field.Sunway Taihu Light,as the representative of domestic high-performance computers,is also the first high-performance computer to fully use Chinese chips and achieve the world's highest computing capacity,which all reflects China's strong comprehensive strength in the field of cutting-edge technology.The Basic Linear Algebra Subroutine(BLAS)library is one of the most widely used libraries in high performance computers,and the General Matrix Vector multiplication(GEMV)function is the foundation of the whole secondary BLAS function library,so the optimization of BLAS function library is particularly important.In order to give full play to the computing advantages of high-performance BLAS library of Sunway 1621 platform,this thesis analyzes the performance of the general matrix vector operation algorithm of Sunway 1621,and carries out the research and implementation of related optimization methods.The main work includes the following aspects:1.Data locality is improved by changing the order of loop iterations and reducing the array access step size by using loop interchange optimization on GEMV function.The average performance of the GEMV function after loop interchange is 3.4 times that of the original one.The average performance of small-scale matrix multiplication operation is improved by 53.3% by using stack allocation of memory space and increasing the step of y vector judgment branch.2.To give full play to the advantages of multi-threaded calculation of multi-core processors,this paper proposes a block algorithm based on loop interchange,according to the size of the cache line to determine the tile size,balancing the benefits of vectorization and boundary processing while shortening the distance between data reuse,The average performance of non-transpose function and transpose function improved by 8.9% and 11.6%respectively.In order to simplify the implementation process of the algorithm,an optimization method for calculating reordering was proposed and the optimal calculation mode was selected.The average performance of non-transpose function and transpose function in this calculation order was improved by 15.9% and 14.8% respectively.3.In order to maximize the vector computing capability of Sunway 1621 processor,an instruction level optimization method using Single Instruction Multiple Data stream(SIMD)and instruction rearrangement is proposed.Experimental results show that the average performance of non-transpose function operation after instruction scheduling is improved by 13.6%,and the average performance of function calculation is 2.17 times that of Goto BLAS.The average operation performance of transpose function is improved by 12.7%after instruction scheduling,and the calculation performance of related functions is also 1.8times that of Goto BLAS version.

Keywords/Search Tags:

Sunway 1621, GEMV, SIMD, Performance optimization

PDF Full Text Request

Related items

1	Based On The Shenwei 1621 Platform BLAS Primary And Secondary Function Optimization Research
2	Implementation And Optimization Of Mixed-radix FFT On Sunway Platform
3	Research On Loop Transformation Optimization Based On Domestic Shenwei Compiler
4	Research On Parallel Optimization Of BLAS Based On The New Generation Of Sunway Many-core Processor
5	Implementation And Optimization Of Convolution Neural Network Library On Sunway Platform
6	Research On Directive-based Parallel Language For Sunway Taihulight Supercomputer And Design Of The Compiling Optimization
7	Parallel Implementation And Performance Optimization For Refactoring GROMACS On The Sunway Many-core Architecture
8	The Research Of High Performance Algorithm For GROMACS Based On Sunway TaihuLight
9	Research Of Parallel Evolutionary Algorithm Based On Sunway Manycore Architecture
10	Research On Performance Tuning Of Matrix Multiplication Based On GPU