Font Size: a A A

Communication Optimization Technique For Distributed Synchronous Data Parallel Training

Posted on:2022-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:N F BiFull Text:PDF
GTID:2518306776493524Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
In recent years,deep learning techniques have been developing rapidly.In particular,driven by the large-scale datasets,distributed deep learning systems have been widely adopted in academia and industry.The distributed deep learning systems commonlyemploy synchronous data parallelism to train models.Synchronous stochastic gradient descent(SSGD)is the most widely used distributed synchronous data parallel training algorithm,which involves communication over network in each iteration.Nevertheless,the communication overhead is expensive in distributed environments with limited bandwidth.A straightforward way to reduce the communication overhead is to increase the communication interval,i.e.communicating only once every couple of iterations,instead of communicating in each iteration.However,increasing the communication interval usually affects the convergence rate of the model.This results in the training algorithm requiring more epochs to train the model to the target accuracy,i.e.,decreases the statistical efficiency of the training algorithm.In addition,the choice of communication interval directly determines the performance of the training algorithm.However,existing methods of choosing the communication interval introduce expensive additional overhead of collecting statistics or adjusting hyper-parameters.To address the problems that exist in the above distributed synchronous data parallel training algorithms and methods of choosing the communication interval,we focus on distributed synchronous data parallel training algorithms with both low communication overhead and high statistical efficiency,and methods of choosing the communication interval with low additional overhead.The main contributions of this thesis around the study are as follows.We propose a training algorithm combining the skipping strategy and the correction technique,which ensures both low communication overhead and high statistical efficiency.This training algorithm maintains a small batch size by local updates in each training process and reduces the divergence among local models by the correction technique,thus ensuring high statistical efficiency.Meanwhile,this training algorithm employs the skipping strategy to update the global model.Instead of up-dating the global model in each iteration,it updates the global model only once every several iterations.This reduces the communication frequency and thus ensures low communication overhead.We design an adaptive communication interval strategy based on the runtime statistics of the first iteration,which reduces the additional overhead of choosing the communication interval.This adaptive communication interval strategy initializes the communication interval to 1,and collects the time spent on communication and computation in the first iteration.Based on the collected statistics,it adjusts the communication interval to make the time spent on communication and the time spent on computation in each epoch are close.After the adjustment of the communication interval in the first iteration,the communication interval is applied to all subsequent iterations.It does not collect statistics and adjust the communication interval in subsequent iterations,which ensures low additional overhead.We implement a prototype system that employs the above-mentioned training algorithm combining the skipping strategy and the correction technique as well as the adaptive communication interval strategy.We implement the training algorithm combining the skipping strategy and the correction technique as well as the adaptive communication interval strategy in Tensor Flow,a distributed deep learning system.Based on this prototype system,we evaluate the efficiency of the training algorithm and the communication interval strategy.Moreover,we elaborate on the design of the prototype system.In summary,we propose the communication-optimized training algorithm and tuning strategy for the problem of high communication overhead in distributed synchronous data parallel training,and implement a prototype system.Experimental results demonstrate the efficiency of the above communication optimization techniques.In particular,compared with the SSGD training algorithm,our training algorithm combining the skipping strategy and the correction technique reduces the overall training time by 88.9%.Compared with the existing communication interval selection strategies,our adaptive communication in-terval strategy reduces the additional overhead by three orders of magnitude.
Keywords/Search Tags:Deep Learning System, Synchronous Training, Data Parallelism, Distributed Training, Communication Optimization
PDF Full Text Request
Related items