Font Size: a A A

Research On Key Technologies Of Memory Management And Communication Optimization For Deep Learning System

Posted on:2021-12-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:B LiuFull Text:PDF
GTID:1488306107456554Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Deep learning is one of the popular fields in artificial intelligence,which outperforms the traditional machine learning when dealing with complex problems of real world.Compared with artificial shallow neural networks,deep neural networks have much more hidden layers and neurons,as well as generate enormous amounts of intermediate data with different types.These data play important roles for the efficiency on the training or inference tasks.However,they also bring mammoth amounts of computation load,storage cost,and communication overhead for the GPU-based deep learning systems.There still exist many issues on deep learning systems being addressed.First,due to the limited memory capacity of accelerated hardware(i.e.,GPU,FPGA),as well as the low utilization of memory resources,it is difficult to accomplish the large-scale iterative training tasks.Second,since the interconnection bandwidth of distributed accelerated cluster is limited,the frequent gradient communication between cluster nodes becomes the performance bottleneck.When solving these above problems,it mainly involves core techniques on the memory management and sparse communication optimization for the intermediate data(i.e.,feature maps,weights,and their gradients)of the training.Specifically,this research focuses on the runtime optimization for the deep learning systems,and develops the corresponding strategies as follows.To address the memory issue on the training program,we first identify the memory usage characteristics for deep and wide convolutional neural networks and demonstrate the opportunities for memory reuse at both the intra-layer and inter-layer levels.We then present Layrub,a runtime GPU memory management strategy that is specific for the feature map data of convolutional neural networks.First,it orchestrates the execution of the backward phase on the training process.Then,it proposes an idea of layer-centric memory reuse to reduce memory consumption based on the memory characteristics of network models and achieves a proper data organization for the feature map with high memory utilization.By reasonably transferring and placing the data between the acceleration hardware and the main memory,the memory consumption during the training process will be significantly reduced,which can provide system-level support for the morphological design and further research of deep neural networks.Experiments show that,compared to the original Caffe,Layrub can cut down the memory usage rate by an average of 58.2% and by up to 98.9%,at the moderate cost of 24.1% higher training execution time on average.Results also show that Layrub outperforms some popular deep learning systems with memory opitimzation strategy and can perform extreme-scale training tasks.To tackle the communication bottleneck on the distributed training,we propose a staleness-compensated sparse gradient optimization mechanism,Grad SA,to improve communication efficiency.First,layer-level gradients are sparsified to reduce the communication overhead,which conforms to characteristics of the network structure.Second,the historical accumulations of the approximated gradients are utilized to speed up convergence.Then,an efficient encoding mechanism is proposed to compress the sparse accumulated historical gradients.Experimental results show that the proposed gradient optimization algorithm can obtain ideal throughput and acceleration performance even in extreme environments with low communication network bandwidth.Furthermore,extensive experiments indicate that Grad SA can reduce gradient size by 514× without performance degradation.By comparing with competitors such as 8-bit quantization,Tern Grad,and DGC,it further proves the advantages of Grad SA in various neural network models and data sets.To reduce the huge consumption of the multiple resources(i.e.,computing,memory,and communication bandwidth)and tackle the conflict between different optimizations,we consider the idea of collaborative optimization on both system memory and bandwidth resources and propose a memory-efficient distributed sparse communication mechanism,called Lay SA.First,to tackle the memory ballooning issue caused by sparse communication,the memory reuse strategy of Layrub is refined,and the data object of the memory optimization is augmented and redefined.Second,a mirror parameter update mechanism is proposed to solve the contradiction between the memory management and the sparse communication optimization for weight gradients.The deep integration and collaborative execution of these two types of strategies can fill the gap in the distributed GPU-based training system in terms of multiple resource optimization.Experimental results show that the proposed collaborative optimization can significantly alleviate the memory pressure of computing nodes and improve both the resource utilization and the system throughput of the distributed training systems.Compared with the baseline system using either Layrub or Grad SA,Lay SA can help to save the system memory usage by up to 80.5%,and the overall training time of the neural network models on a single GPU is reduced by about 12.25%.Furthermore,Lay SA can extremely scale up the batch size of datasets during the distributed training,and the overall throughput has increased by more than 150%,which outperforms the current training systems that use memory optimization or communication optimization mechanisms alone.In summary,this research studies the memory management and sparse communication optimization policies for the intermediate data of deep neural networks,aiming to guarantee the efficiency of the runtime training program.At the same time,it also proposes to enhance and refine the intermediate data by considering both the data characteristics and quantity,aiming to keep the convergence quality of the training process.Through the above research,it can excavate the efficiency and performance of large-scale deep learning as much as possible with the restricted hardware resources such as computing,memory,and bandwidth.
Keywords/Search Tags:Deep Learning, Distributed Training, Intermediate Data, Memory Management, Sparse Communication Optimization
PDF Full Text Request
Related items