Font Size: a A A

Implementation And Optimization Of Tensor Library Based On Sunway Domestic Supercomputer Platform

Posted on:2022-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:J GaoFull Text:PDF
GTID:2518306731497894Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of domestic Sunway supercomputer system independently developed in China,large and complex scientific research programs in many application fields have successively completed computing tasks on this platform,providing strong support for the development of China’s cutting-edge technology.However,compared with the high-performance tensor operator library on mainstream hardware architectures such as multi-core CPU and GPU,the tensor operator library developed based on domestic self-developed Shenwei 26010 pro multi-core processor is slightly insufficient in performance and types,which is not enough to form a mature tensor operator library ecological environment on Sunway domestic supercomputing platform,so as to support the efficient development of various cutting-edge scientific research applications.Therefore,exploring the parallel optimization method and mature design scheme of tensor operator library in different fields on domestic Sunway supercomputing platform is of great significance to help the efficient development of large complex scientific research applications.The main research contents and innovations of this paper are as follows:1.In terms of matrix multiplication,facing the traditional scientific computing application,aiming at the low efficiency of the original matrix multiplication library on the new generation Shenwei 26010 pro processor,this paper optimizes and constructs the level-3 matrix multiplication library SWBLAS: a method of matrix extension is proposed to preprocess the input matrix,so as to realize the parallel matrix multiplication of arbitrary shape input matrix;a method of balanced blocking is proposed to parallelize the matrix multiplication task,which ensures the load balance between all computing cores;a method of diagonal broadcast is proposed to schedule the core array to access the main memory by DMA,and RMA operation is used to broadcast rows and columns respectively,which solves the problems of memory access competition and data repeated reading;a method of core loop is proposed to realize the vectorization acceleration and multi instruction fusion of the core computing part of matrix multiplication,so as to improve the parallel efficiency of computing.Experiments show that SWBLAS tensor operator library achieves exponential computing acceleration of four orders of magnitude compared with the basic method of MPE,and the computing performance reaches91.4% of the peak performance of hardware.2.In terms of deep neural network operators,facing the application of large-scale neural network models for deep learning,the existing operator library SWDNN can not give full play to the performance of Shenwei processor,nor can it meet the memory and memory access bandwidth requirements of large-scale models,Based on SWBLAS single core group tensor operator library,this paper constructs multi-core group tensor acceleration library SWTensor:methods of three-level parallel and neural network operator task scheduling scheme based on multi-core group are proposed,which not only meets the memory requirements of large-scale model training,but also improves the parallel efficiency and overall computing performance;methods of three-level asynchronous pipelining mechanism and a memory access optimization method for calculating memory access overlap are proposed,which significantly alleviates the memory access performance bottleneck of neural network operators.The experimental results of natural language processing model GPT-2 show that the typical computation intensive operators and memory access intensive operators in SWTensor achieve 90.4% and 88.7% of the theoretical peak in single precision floating-point computing performance and memory access bandwidth,respectively.3.In terms of tensor transposition,for tensor network related applications in the field of quantum computing simulation,in view of the problem that the traditional tensor transposition library does not adapt to Shenwei heterogeneous multi-core hardware architecture and does not meet the performance requirements of high-dimensional tensor contraction,this paper proposes the tensor transposition library SWTT based on the SWBLAS matrix multiplication library: a tensor transposition optimization algorithm is proposed,which uses blocking step memory access and parallel local transposition to solve the problems of scattered output elements and unable to combine memory access;a tensor contraction optimization algorithm is proposed.Based on the traditional TTGT method,the tensor transposition and matrix multiplication are combined into a fusion operator,which solves the problems of repeated write back and read in and additional memory overhead of the intermediate transposition matrix,and further improves the memory access utilization through methods of memory reuse and Ring-RMA.Experiments show that SWTT tensor transposition library achieves three orders of magnitude exponential acceleration compared with the basic method of MPE,and the memory access performance reaches 85.5% of the peak performance of hardware.
Keywords/Search Tags:Tensor operator library, Matrix multiplication, Deep neural network, Tensor transpose, Shenwei multi-core processor
PDF Full Text Request
Related items