Font Size: a A A

Parallel Deep Learning Training System On Sunway TaihuLight

Posted on:2020-08-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:J R FangFull Text:PDF
GTID:1368330626464469Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Deep learning is currently the most successful artificial intelligence technology,and it is expected to lead human beings into an intelligent era.The huge computing demand is driving the combination of deep learning and supercomputers.Since the US banned the sale of high-performance-computing chip to China,the planned next-generation Chinese supercomputers will all be manufactured using domestic many-core processors.However,there is still no research on deep learning software systems designed for the domestic supercomputers,and the implementation of such a system faces many challenges:First,there is no systematic optimization guidance designed for the innovative hardware features of the domestic processor.Second,it is hard to map complex computation patterns of deep learning to the new architecture.Third,compilation tools and system libraries on the domestic supercomputers are hard to be leveraged.Fourth,innovative optimization methods are required to solve the problems of hardware modules such as network and I/O when scaling to large-scale.In order to solve the above challenges,this thesis proposes a systematic methodology of building a deep learning system on Sunway TaihuLight,which is the most powerful Chinese supercomputer adopting a domestic heterogeneous many-core processor called SW26010.In order to develop a deep learning system more efficiently,a modular software design is adopted which decomposes the system into different functional modules,includ-ing GEMM,deep learning operator,automatic code-tuner and network communication.The main contributions of this thesis are as follows:First,a performance analysis model and a tensorization programming model cus-tomized for the innovative features of SW26010 architecture are proposed.Under the guidance of the performance analysis model,the tensorization programming model ex-presses the optimal algorithm workflow as a combination of tensor-operated memory access and computation primitives.In this way,the gap between hardware usage and al-gorithm design is easily bridged.In order to implement the important GEMM primitives,a matrix multiplication algorithm based on register communication feature of SW26010 many-core processor is designed.Secondly,applying the performance analysis model and the tensorization program-ming model,a set of deep learning operators are optimized on SW26010,including convolutional,fully-connected,LSTM operators.In addition,an end-to-end automatic code tuning method is proposed to reduce the engineering burden.As a result,the computational efficiency of tuned operators on SW26010 is better than cuDNNv7.5 of GPU.Thirdly,this thesis studies the key techniques of scaling deep learning training on supercomputers,and breaks through the bottleneck of scalability at the system and algo-rithm level.At the system level,a parallel training framework is implemented on Sunway TaihuLight.After optimizing of the modules such as network communication,I/O,mem-ory management,It is able to train the popular deep learning models on 1024 node scale At the algorithm level,to reduce the amount of data to be communicated,a data parallel method based on residual gradient compression is designed,which improves the scala-bility of the system without losing accuracy.It not only significantly speeds up the deep learning training tasks that were difficult to scale on the latest GPU supercomputer,but also provides a reference for deep learning system software design on the next generation of domestic supercomputers.
Keywords/Search Tags:Sunway TaihuLight Supercomputer, Deep Learning System, Automatic Optimization, Parallel Computing, High-Performance-Computing
PDF Full Text Request
Related items