Font Size: a A A

Optimizations For Data Path In Parallel And Distributed Neural Network Training

Posted on:2022-04-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y H BaiFull Text:PDF
GTID:1488306323963659Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In the big data-big computing scenario,Deep Neural Network(DNN)has made some breakthroughs in both structured data and unstructured data.The DNN models need to be trained on large datasets before deployment,training process is compute-intensive and time-consuming,therefore,parallel and distribution quickly become im-portant strategies to accelerate the DNN training,but because the gradient data syn-chronization and data loading become the bottleneck of data parallel DNN training and parallel Graph Neural Network(GNN)training,leading to the parallel efficiency is low.The gradient compression algorithms can greatly reduce the gradient size without af-fecting the training accuracy of the DNN model,which makes it possible to accelerate the data parallel DNN training.However,because the algorithm design ignores the problems and challenges brought by the system,the acceleration effect of the gradi-ent compression algorithm in the actual system is inefficient.The strategies such as caching and graph partition can reduce the data loading cost in the GNN training,but when the cache hit ratio decreases with the increase of graph size,the data loading stage still restricts the training process of GNN.For the parallel and distributed Neural Network training problems,this dissertation re-explored the architecture and workflow of the DNN training system,first,it proposes a high-performance,compression-aware framework H i Pr e s s for data parallel DNN training,including the compression-aware gradient synchronization strategy,whether the gradient is compressed and how to parti-tion,and the convenient development and deployment of the gradient compression algo-rithms toolkit.Second,it proposes an efficient data loading strategy for fast sampling-based GNN training on large graphs.The research contents of this dissertation are as follows:(1)Compression-Aware Gradient Synchronization StrategyThe operations related to the gradient compression algorithm have non-trivial costs,these costs will not only delay the computation of the Neural Network model,but also affect the gradient synchronization phase of the data parallel distributed training system,which is specifically reflected in the conflicts of gradient compression algo-rithm with traditional network optimization,and compression and decompression op-erations in the entire process of gradient synchronization will be accumulated as the number of training nodes increases.In response to this problem,we reconsider the sys-tem architecture enabled by gradient compression,and propose a gradient compression-aware synchronization strategy Ca S ync,which first splits the task of traditional tightly coupled computation and transmission,and classifies the tasks in the training process into model computation,compression-related computation and network transmissions,which makes the scheduling of tasks independent with the specific transmission archi-tecture,bringing more possibilities for task parallelism and scheduling.Secondly,it hides compression-related computation into other types operations through the pipeline mechanism.Finally,for the split small tasks,it uses a bulk synchronization mechanism to accelerate them,and at this stage,CaSync reorganize the split tasks into a specific transmission structure according to the user's settings.Experiments show that when traning VGG19 and Bert-base models on 16 servers with one GPU per server,CaSync can reduce the gradient synchronization time cost by 63.5%and 59.2%respectively.(2)Selective Compression and Partitioning MechanismThe deep Neural Network model has a wide range size of gradients,with large gradients exceeding several hundred megabytes and small gradients only a few bytes.Compressing small gradients will no longer reduce network transmission latency,but will increase additional compression and decompression related computational over-head,delay the gradient synchronization process,which is called over-compression.In addition,large gradients need to be partitioned into several parts to ensure load balance among training nodes.Aiming at the question of whether the gradient is compressed and how to partition,this dissertation proposes a gradient selective compression and partitioning strategy SeCoPa,which uses a cost model to do offline analysis,SeC-o Pa uses a specific Neural Network model,hardware platform,the specific compres-sion algorithm and cluster size as input,then firstly find the number of gradient parti-tions that can minimize the time of the synchronization with and without compression respectively,and then compare the two time cost to determine whether compression is needed,and generate a compression/partitioning plan for each gradient.The cost model of SeCoPa unifies the two popular communication topologies(PS and Ring-allreduce),and the offline generated plans will guide the training process of HiPress during runtime.Experiments show that SeCoPa can accurately balance the benefits and costs of compression,and the error between the predicted results and the actual results is less than 5%.By training VGG19 and Bert-base on 16 servers and one GPU each,SeCoPa can reduce the gradient synchronization time overhead by 14.9%and 13.6%,respectively combined with CaSync.(3)A Gradient Compression Development ToolkitThe efficiency of gradient compression-related computation directly affects the end-to-end training speed,and the development and deployment of gradient compres-sion algorithms need to be designed and implemented across all layers of the deep learn-ing system stacks,which will bring users who are not familiar with the underlying ar-chitecture of the system difficulties and challenges.In response to this problem,we propose a gradient compression algorithm agile development toolkit CompLL,which first abstracts 7 common operators from popular gradient compression algorithms,and implements with detailed optimization on the GPU,developers can implement the gra-dient compression algorithm by the combination of 7 common operators.Then further,in order to reduce the workload of the development and deployment of the gradient com-pression algorithm,CompLL provides a domain specific language(DSL)for abstract description of the gradient compression algorithm,CompLL provides code generator to translate the algorithm logic described by DSL into GPU-oriented C++implementation,and register it into the popular computing Neural Network systems through the provided wrapper.Experiments show that using the CompLL,users can deploy the four gradient compression algorithms used in this dissertation with only a few dozen lines of code,and the code generated by the CompLLcan achieve 12x speedup when compressing a 256MB gradient compared to the open source GPU-oriented implementation.(4)Optimization Strategy of Data Loading for GNN TrainingWhen training GNNs based on sampling on large scale graphs,the data loading stage becomes a bottleneck that restricts the training speed.Existing work has leveraged static caching and other strategies to accelerate the data loading stage,but because the cache efficiency decreases with the increase of the graph scale,the data loading stage still restricts the speed of the GNN training.In response to the above problems,this article proposes a sampling-based GNN data loading optimization strategy GLoader on large-scale graphs.It finds that the hardware resources used in the data loading phase and the model computation phase do not conflict with each other,so it adopts the pipeline mechanism on the basis of existing work to further hide the data loading costs.The strategy of GLoader has been verified on large scale graphs and different sampling algorithms.Experiments show that using GLoader in a single server with 4 GPUs environment,seven graph datasets and two sampling algorithms to train two GNN models(GCN and GraphSAGE),the pipeline strategy of GLoader combined with caching strategy can completely hide the data loading stage into computation stage.The computing resources can be fully utilized and the scalability of data parallel training can be improved to be close to linear.In order to hide the complex process of the underlying system and provide user-friendly API,we organized CaSync,SeCoPa and CompLL into a data parallel dis-tributed DNN training framework HiPress.The framework is implemented on top of the popular computing systems MXNet,TensorFlow and PyTorch.Experimental re-sults show that when training four deep neural network models on 16 servers with a maximum of two GPUs per server,compared with the open source frameworks without compression and with compression,HiPress can achieve 1.2-10.3× and 1.4-15.4×speedup respectively.HiPress can increase the scaling factor of four DNN models up to 0.95,which is very close to linear scaling.The GLoader is implemented based on DGL and Py Torch.
Keywords/Search Tags:Neural Network Training, Distributed Training Framework, Data Parallel, Gradient Compression, Large Graph, Graph Neural Network, Pipeline
PDF Full Text Request
Related items