Font Size: a A A

Performance Analysis And Optimization Of General Matrix-vector Operations Based On Shenwei 1621

Posted on:2022-11-16Degree:MasterType:Thesis
Country:ChinaCandidate:J DengFull Text:PDF
GTID:2518306755960859Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
According to the analysis of the development status of HPC,HPC is closely connected with computing power service and artificial intelligence,which shows that HPC has an irreplaceable position in the frontier field.Sunway Taihu Light,as the representative of domestic high-performance computers,is also the first high-performance computer to fully use Chinese chips and achieve the world's highest computing capacity,which all reflects China's strong comprehensive strength in the field of cutting-edge technology.The Basic Linear Algebra Subroutine(BLAS)library is one of the most widely used libraries in high performance computers,and the General Matrix Vector multiplication(GEMV)function is the foundation of the whole secondary BLAS function library,so the optimization of BLAS function library is particularly important.In order to give full play to the computing advantages of high-performance BLAS library of Sunway 1621 platform,this thesis analyzes the performance of the general matrix vector operation algorithm of Sunway 1621,and carries out the research and implementation of related optimization methods.The main work includes the following aspects:1.Data locality is improved by changing the order of loop iterations and reducing the array access step size by using loop interchange optimization on GEMV function.The average performance of the GEMV function after loop interchange is 3.4 times that of the original one.The average performance of small-scale matrix multiplication operation is improved by 53.3% by using stack allocation of memory space and increasing the step of y vector judgment branch.2.To give full play to the advantages of multi-threaded calculation of multi-core processors,this paper proposes a block algorithm based on loop interchange,according to the size of the cache line to determine the tile size,balancing the benefits of vectorization and boundary processing while shortening the distance between data reuse,The average performance of non-transpose function and transpose function improved by 8.9% and 11.6%respectively.In order to simplify the implementation process of the algorithm,an optimization method for calculating reordering was proposed and the optimal calculation mode was selected.The average performance of non-transpose function and transpose function in this calculation order was improved by 15.9% and 14.8% respectively.3.In order to maximize the vector computing capability of Sunway 1621 processor,an instruction level optimization method using Single Instruction Multiple Data stream(SIMD)and instruction rearrangement is proposed.Experimental results show that the average performance of non-transpose function operation after instruction scheduling is improved by 13.6%,and the average performance of function calculation is 2.17 times that of Goto BLAS.The average operation performance of transpose function is improved by 12.7%after instruction scheduling,and the calculation performance of related functions is also 1.8times that of Goto BLAS version.
Keywords/Search Tags:Sunway 1621, GEMV, SIMD, Performance optimization
PDF Full Text Request
Related items