Parallel Deep Learning Training System On Sunway TaihuLight

Posted on:2020-08-04

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J R Fang

Full Text:PDF

GTID:1368330626464469

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Deep learning is currently the most successful artificial intelligence technology,and it is expected to lead human beings into an intelligent era.The huge computing demand is driving the combination of deep learning and supercomputers.Since the US banned the sale of high-performance-computing chip to China,the planned next-generation Chinese supercomputers will all be manufactured using domestic many-core processors.However,there is still no research on deep learning software systems designed for the domestic supercomputers,and the implementation of such a system faces many challenges:First,there is no systematic optimization guidance designed for the innovative hardware features of the domestic processor.Second,it is hard to map complex computation patterns of deep learning to the new architecture.Third,compilation tools and system libraries on the domestic supercomputers are hard to be leveraged.Fourth,innovative optimization methods are required to solve the problems of hardware modules such as network and I/O when scaling to large-scale.In order to solve the above challenges,this thesis proposes a systematic methodology of building a deep learning system on Sunway TaihuLight,which is the most powerful Chinese supercomputer adopting a domestic heterogeneous many-core processor called SW26010.In order to develop a deep learning system more efficiently,a modular software design is adopted which decomposes the system into different functional modules,includ-ing GEMM,deep learning operator,automatic code-tuner and network communication.The main contributions of this thesis are as follows:First,a performance analysis model and a tensorization programming model cus-tomized for the innovative features of SW26010 architecture are proposed.Under the guidance of the performance analysis model,the tensorization programming model ex-presses the optimal algorithm workflow as a combination of tensor-operated memory access and computation primitives.In this way,the gap between hardware usage and al-gorithm design is easily bridged.In order to implement the important GEMM primitives,a matrix multiplication algorithm based on register communication feature of SW26010 many-core processor is designed.Secondly,applying the performance analysis model and the tensorization program-ming model,a set of deep learning operators are optimized on SW26010,including convolutional,fully-connected,LSTM operators.In addition,an end-to-end automatic code tuning method is proposed to reduce the engineering burden.As a result,the computational efficiency of tuned operators on SW26010 is better than cuDNNv7.5 of GPU.Thirdly,this thesis studies the key techniques of scaling deep learning training on supercomputers,and breaks through the bottleneck of scalability at the system and algo-rithm level.At the system level,a parallel training framework is implemented on Sunway TaihuLight.After optimizing of the modules such as network communication,I/O,mem-ory management,It is able to train the popular deep learning models on 1024 node scale At the algorithm level,to reduce the amount of data to be communicated,a data parallel method based on residual gradient compression is designed,which improves the scala-bility of the system without losing accuracy.It not only significantly speeds up the deep learning training tasks that were difficult to scale on the latest GPU supercomputer,but also provides a reference for deep learning system software design on the next generation of domestic supercomputers.

Keywords/Search Tags:

Sunway TaihuLight Supercomputer, Deep Learning System, Automatic Optimization, Parallel Computing, High-Performance-Computing

PDF Full Text Request

Related items

1	The Design And Optimization Of High-performance Molecular Dynamics Algorithms On The Sunway TaihuLight Supercomputer
2	Design And Implementation Of Heterogeneous Parallel Algorithms On The Sunway Taihulight
3	Optimization Of Molecular Dynamics Algorithms Based On The Sunway TaihuLight Supercomputer
4	The Research Of High Performance Algorithm For GROMACS Based On Sunway TaihuLight
5	Porting And Optimization Of OpenFOAM On The Sunway Taihulight Supercomputer
6	The Optimization Of The Tend_lin Application Task Graph Parallel On Sun Way TaihuLight Supercomputer
7	Porting And Optimizing GTC-P Code On Sunway TaihuLight Supercomputer
8	Research On Directive-based Parallel Language For Sunway Taihulight Supercomputer And Design Of The Compiling Optimization
9	Parallel Optimization Of Data Intensive Computing On Sunway TaihuLight
10	Implementation And Optimization Of Molecular Dynamics Application On Sunway Taihulight Supercomputer