Research On Key Technologies For Training Efficiency Optimization Of Distributed Machine Learning Over WAN

Posted on:2024-12-19

Degree:Doctor

Type:Dissertation

Country:China

Candidate:H M Zhou

Full Text:PDF

GTID:1528307373468974

Subject:Communication and Information System

Abstract/Summary:

In order to leverage the aggregation and value-added effects of massive data distributed in geo-distributed edge servers or institutions,distributed machine learning over WAN(DML-WAN),as a data privacy-preserving collaborative learning paradigm,is widely used.It directly deploys training tasks on logical participating nodes over WAN.A typical deployment is to perform collaborative training under the coordination of a central parameter server.The participating nodes do not directly interact with data but interact with intermediate results during training,thus completing model training collaboratively without exposing data.Compared with distributed machine learning within data centres,DML-WAN has many new features in the following three dimensions: communication domain,computing domain and data domain.These new features make DML-WAN face more severe system efficiency bottlenecks and model accuracy problems.First,in the communication domain,WAN have more complex network characteristics such as bandwidth-limited,high transmission latency,low reliability and time-varying.Distributed machine learning has a large parameter communication volume and high communication frequency.The high communication-computation ratio of DML-WAN slows down the efficiency of the entire system and even makes the entire training task unable to be completed within an acceptable time.Second,in the computing domain,there is a strong computing heterogeneity among participating nodes across regions,and the heterogeneity is not static and is unpredictable time-varying.The computing heterogeneity means that participating nodes have different local computation times in each training round,which makes parameter synchronization be blocked by stragglers.Although some methods have been proposed to solve the synchronization blocking problem,they need to introduce additional system costs and model accuracy loss.Third,in the data domain,the training data of participating nodes have biases in quantity,quality and preference,and often conform to non-independent and identically distributed(or called data heterogeneity).Data heterogeneity makes the heterogeneity of gradients or updates from different participating nodes lower the generalization of the global model,resulting in model bias and low model accuracy.Therefore,the dissertation improves the performance of DML-WAN from two aspects: parameter communication optimization and parameter synchronization optimization.The research content and main contributions of this dissertation are as follows.(1)Aiming at the parameter communication bottleneck,this dissertation believes that the scientific problem to be addressed is how to adaptively schedule parameter transmission based on parameter characteristics and physical network conditions to optimize the communication performance of parameter synchronization.Furthermore,according to the communication service level,this dissertation decomposes parameter communication optimization into optimizing end-to-end parameter communication and optimizing parameter communication mode to improve parameter communication performance over bandwidth-limited,low-reliability,dynamically heterogeneous wide-area networks.First,for the end-to-end parameter communication optimization,this dissertation proposes an importance-aware differential gradient transmission protocol(Differential Gradient Transmission Protocol,DGT),which improves the performance of end-to-end gradient transmission.DGT classifies gradients according to their importance,and then servers their differentiated transmission Qo S(such as reliability and priority),so that important gradients can be less lost,and faster updated to the model,thereby further improving the end-to-end gradient transmission communication performance under the premise of ensuring that the model accuracy is not damaged.Experiments show that compared with the reliable gradient transmission(baseline),DGT reduces the completion time of classic GoogleNet training task by 19.4%,AlexNet training task by 34.4%,VGG11 training task by 36.5%,and its acceleration is significantly better than heuristic bounded-tolerance approximate gradient transmission.Second,for the parameter communication mode optimization,this dissertation proposes a parameter transmission scheduling mechanism(Transmission Scheduling Engine,TSEngine),which improves the communication performance of model distribution and model aggregation.TSEngine divides the process of parameter communication into model distribution and model aggregation.For the model distribution,the dissertation proposes an auto-learning communication scheduling protocol,which dynamically selects distribution nodes and schedules transmission order by combining random selection and greedy selection to reduce the average communication time for workers to receive models.For model aggregation,this dissertation proposes a minimal waiting-delay communication scheduling protocol,which prioritizes reducing aggregation blocking caused by resource heterogeneity to reduce model aggregation delay.Experiments show that compared with traditional Hub-and-Spoke and static tree-based communication overlays(RACK and BINARY),TSEngine speeds up AlexNet training tasks by 1.95×～2.38×.(2)Aiming at the parameter synchronization blocking and model accuracy problems,this dissertation believes that the scientific problem to be addressed is how to adaptively schedule the parameter synchronization based on real-time network perceptions for optimizing collaborative learning performance.First,for the high proportion of parameter synchronization blocking,this dissertation proposes a joint optimization of algorithm and system for high parallel between local computation and global synchronization(Non-blocking Synchronization,NBSync),which hides global synchronization overhead and improves distributed machine learning system efficiency.NBSync relaxes the dependency of model update on the previous round(strong dependency)to dependency on the previous two rounds(weak dependency).Based on weak dependency relaxation,NBSync schedules parallel execution of local computation and global synchronization on the system level and proposes an adaptive local computation mechanism,which senses the global synchronization time window and incrementally schedules local computation.NBSync makes local computation fully utilize computing power to explore higher-quality gradients,thereby efficiently hiding parameter synchronization.Experiments show that under computing heterogeneity(computing heterogeneity as H=150:1)and limited bandwidth(WAN bandwidth as B=66Mbps),compared with the ESync(speed baseline),NBSync speeds up the training performance by2.79×.Compared with FedAvg(accuracy baseline),it does not lose convergence accuracy.In addition,we verify its effectiveness on other complex models,complex datasets and complex data distributions.Second,for the model accuracy problem of computing heterogeneous-tolerated training with heterogeneous data,this dissertation proposes a joint optimization of group-based weak synchronization and differential gradient transmission(Federated Group-based Synchronization,FedGSync),which accelerates distributed machine learning under the coexistence of computing heterogeneity and data heterogeneity.FedGSync innovatively prioritizes grouping workers according to data distribution and then designs a group-based weak synchronization algorithm to reduce synchronization blocking.Secondly,FedGSync designs a fast approximate grouping mechanism,which accelerates gradient principal component analysis and grouping algorithm based on historical state,and improves system efficiency.Finally,FedGSync designs a group-based gradient transmission,which prioritizes scheduling the gradient transmission from straggler groups based on historical feedback,thereby further reducing the blocking of weak synchronization.Experiments show that compared with FedAT(speed baseline),FedGSync improves the training performance by 1.28×～1.74×.Compared with FedSSGD(accuracy baseline),FedGSyne has the least accuracy loss,which is less than 0.01 in all experimental tasks.It can be concluded that FedGSync has excellent robustness to both computing heterogeneity and data heterogeneity.

Keywords/Search Tags:

Distributed Machine Learning over WAN, Multi-party Collaborative Learning, Communication Optimization for AI System, Parameter Synchronization and Parallel Optimization

Related items

1	Research On Parameter Communication Optimization For Distributed Machine Learning System
2	Communication Optimization Technology For Distributed Machine Learning Framework
3	Communication Dynamic Optimizing Technology For Distributed Machine Learning
4	Communication Efficient Distributed Parallel Stochastic Optimization Algorithms
5	Research On Distributed Optimization Methods For Large-Scale Machine Learning
6	On Network Optimization Technology For MXNet-based Large-scale Distributed Machine Learning
7	Research On Scheduling Optimization Of Distributed Machine Learning System
8	Data Center Network Resource Configuration And Transmission Optimization For Distributed Machine Learning
9	Research On Data Parallel Communication Strategy For Distributed Machine Learning System
10	Research On Optimization Methods Of Distributed Machine Learning Based On Federated Learning