Font Size: a A A

Efficient Communication Strategies For Inter-datacenter Federated Learning

Posted on:2022-12-23Degree:MasterType:Thesis
Country:ChinaCandidate:Z F ZhangFull Text:PDF
GTID:2518306764962289Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Inter-datacenter federated learning facilitates machine learning across multiple geographically distributed datacenters by exchanging model parameters rather than raw data over low-speed WAN links,protecting user data privacy.As a result,inter-datacenter federated learning is becoming a very important paradigm for distributed machine learning.However,inter-datacenter links with low transmission rate,system heterogeneity among datacenters and intra-datacenter links with limited bandwidth bring new challenges to the model synchronization of inter-datacenter federated learning.(1)Inter-datacenter links with low transmission rate.Inter-datacenter federated learning needs to synchronize model parameters through low-speed WAN during the model training process.Because of the enormous number(e.g.up to tens of thousands)of model parameter synchronizations and huge size(e.g.up to hundreds of GBs)of model parameters,the communication overhead of inter-datacenter model synchronization is very high,which will seriously impact training efficiency.(2)System heterogeneity among datacenters.The computing and communication capabilities of computing nodes in different datacenters vary a lot,the time to complete the single training iteration round differs,which can easily cause straggler problem and seriously slow down the speed of global model synchronization and training.(3)Limited intra-datacenter bandwidth.In a single datacenter,there is generally a very high traffic generated by a large number of computing nodes that need to exchange their large model parameters for local aggregation.Such a high traffic brings huge communication pressure to the slow-growing and limited intra-datacenter bandwidth.To address the challenges above,this thesis studies inter-datacenter federated learning communicationefficient strategies from the perspective of communication efficiency optimization.The main work is summarized as follows:First of all,we propose an optimized hierarchical parameter server architecture to reduce the communication overhead between datacenters.In the traditional hierarchical parameter server architecture,the communication mode that launches one inter-datacenter iteration after finishing every intra-datacenter iteration has the problem of high interdatacenter communication overhead and low training efficiency.We propose a novel communication mode that only launches one inter-datacenter iteration after finishing multiple intra-datacenter iterations.To support this functionality in a real distributed training framework,we design an indicator mechanism to determine the time that the intradatacenter parameter server will perform one intra-datacenter iteration or one inter-datacenter iteration.In addition,we innovatively propose to offload the optimizer from the parameter server to computing nodes,which can support sparse compression of the downlink data from the intra-datacenter parameter server.Secondly,to solve the straggler problem caused by inter-datacenter system heterogeneity,we propose a communication-efficient synchronization mechanism.As is well known,the performances of different datacenter are usually varying a lot,which leads to the challenge of system heterogeneity to inter-datacenter federated learning.With the goal of minimizing the single-round iteration time,we propose a heterogeneity-aware semi-synchronous mechanism.The main idea is to let datacenters with different performances run different rounds of intra-datacenter synchronization before performing interdatacenter synchronization.In order to determine a proper number of intra-datacenter synchronization for each datacenter,we propose a communication frequency optimization algorithm.As the training proceeds and the model converges,we observe that optionally reducing the frequency of inter-datacenter model synchronization can help improve the training efficiency.Motivated by this observation,we propose a training progress-aware communication frequency adjustment algorithm to further improve the communication efficiency for model synchronization among datacenter.Experiments show that the proposed two technologies achieve an inter-datacenter communication round reduction of up to 66% and a convergence time gain of up to 43%,without accuracy loss.Lastly,we propose a bidirectional sparse compression technology for intra-datacenter parameter aggregation to mitigate the contradiction between the huge communication overhead of parameter aggregation and the limited intra-datacenter bandwidth.Compared with the rapid growth in the number of training nodes and machine learning models,the bandwidth capacity in the datacenter grows very slowly,which makes the communication bottleneck of parameter aggregation in the datacenter more prominent.In this thesis,we adopt data compression technology for greater communication efficiency and propose a two-way sparse compression technology for large gradient tensors and a mixed precision compression technology for small gradient tensors.According to the characteristics that different layers of a model holds different amounts of parameters,the layer-by-layer transmission mechanism is adopted to push and pull model gradients and different compression strategies are adopted to compress gradient tensors with different sizes.Experiments show that the proposed two technologies achieve a convergence time gain of up to 92% and a compression ratio of up to 95%,without accuracy loss.
Keywords/Search Tags:Intra-datacenter Federated Learning, Hierarchical Parameter Server Architecture, Communication Efficiency, Communication Frequency Adjustment, Gradient Compression
PDF Full Text Request
Related items