Research On Parallel Optimization Of Transformer Model Based On The New Generation Of Sunway Many-core Processors

Posted on:2022-12-28

Degree:Master

Type:Thesis

Country:China

Candidate:Y Q Jiang

Full Text:PDF

GTID:2518306773997509

Subject:Automation Technology

Abstract/Summary:

PDF Full Text Request

Transformer is a deep learning architecture based on multi-headed attention mechanism,which is a landmark in the field of natural language processing and has been widely used on many large-scale models in the industry.In recent years,Transformer has also become an important research direction in the field of computer vision.transformer architecture introduces parallelization,using parallel training,which can greatly reduce training time and improve computational efficiency.In 2021,a new generation of domestic Sunway many-core processors SW26010 Pro was officially launched,and its supporting software environment has completed the porting of Py Torch,but the implementation and optimization of Transformer models are still not perfect,specifically: Firstly,some deep learning operators have problems such as insufficient generality,which cannot meet the Transformer.Secondly,some deep learning operators have the problem of low parallel efficiency and cannot fully utilize the computational power of the many-core processors;Thirdly,there is a lack of performance tests on typical operators in Transformer and training tests on deep learning models based on Transformer.In order to solve the problem of parallel optimization of Transformer models based on the new generation of Sunway many-core processors.In this paper,we first study the hardware architecture characteristics of SW26010 Pro and design the parallel optimization algorithm of the basic operator in Transformer.Then,we design the parallel optimization algorithm for the deep learning operator in Transformer.Finally,we design a large number of test scenarios to verify the final optimization results.Specifically,the main contributions of this paper are as follows.Firstly,this paper investigates the hardware architecture characteristics of SW26010 Pro and designs an efficient parallel optimization algorithm for typical basic operators in Transformer for its hardware characteristics,which significantly improves the computational efficiency and access bandwidth of the basic operators.The experimental results show that the computation efficiency of computationintensive basic operators such as matrix multiplication can be over 90%,and the access speed of access memory intensive basic operators such as element-by-element operation can be close to the peak of memory bandwidth,which can effectively accelerate deep learning inference and training.Secondly,this paper investigates the development of a deep learning operator library based on SW26010 Pro,designs an efficient parallel optimization algorithm for typical deep learning operators in Transformer,and forms a complete set of Transformer acceleration library Swpex.The experimental results show that the Layer Norm,Softmax,Linear,MHA and other operators implemented in this paper have significantly improved the speedup effect compared with the original version of Pytorch operators.Thirdly,this paper investigates the training process of Transformer models on the Sunway heterogeneous many-core architecture,and designs a test method for the base operator,the deep learning operator,and the Transformer-based Vi T model under various parameter combinations.The experimental results show that the performance of SW26010 Pro is well utilized by both the basic operator and the deep learning operator;the Vi T model with the Swpex acceleration library achieves good acceleration compared to the original version,up to 269 times in the single-node case.

Keywords/Search Tags:

Transformer, Parallel Computing, New Generation Of Sunway many-core Processors, Deep Learning

PDF Full Text Request

Related items

1	Research On Parallel Optimization Of BLAS Based On The New Generation Of Sunway Many-core Processor
2	Parallel Deep Learning Training System On Sunway TaihuLight
3	Implementing Molecular Dynamics Simulation On The Sunway TaihuLight System With Heterogeneous Many-Core Processors
4	Parallel Implementation And Performance Optimization For FHI-aims On The Sunway Many-core Architecture
5	Design And Implementation Of Hybrid Parallel Genetic Algorithm Based On Sunway Many-core Processors
6	Parallel Optimization Of Data Intensive Computing On Sunway TaihuLight
7	Research On Acceleration Technology For Deep Learning Inference Based On Multi-core And Many-core Platforms
8	Research Of Parallel Evolutionary Algorithm Based On Sunway Manycore Architecture
9	Parallel Optimization And Realization Of HEVC Decoder Based On Multi-Core Processors
10	Design And Implementation Of Heterogeneous Parallel Algorithms On The Sunway Taihulight