| Distributed machine learning has emerged as a pivotal technology in modern AI training.Although advanced data centers have been established worldwide and computing power has significantly increased,their limited capabilities and digital resources are still insufficient when addressing massive user demands.Firstly,the available computing resources are distributed in fragments,leading to long delays as users should wait for devices to become available for new training tasks.Secondly,the training data in these centers exhibit regional biases and,due to privacy restrictions,cannot be combined into a single center.This hinders the training models from learning more generalizable features.As a result,the industry is seeking to integrate computing and data resources from multiple centers.This integration aims to improve resource utilization,support more distributed training tasks promptly,and develop AI models with better generalization.Geo-distributed machine learning(Geo ML)has developed as a solution,enabling data centers to train model replicas using local data while synchronizing by exchanging learned parameters or gradients,without moving raw data.This technology combines fragmented computing power and diverse data from multiple centers,boosting the training of models with enhanced generalization capabilities.However,Geo ML systems operate across low-bandwidth,high-latency wide-area networks(WANs)and must address challenges of resource heterogeneity and dynamism.Firstly,the limited WAN bandwidth results in high latency for parameter transmission,becoming a major efficiency bottleneck.Secondly,WAN resources are heterogeneous and variable over time,frequently causing synchronization blocks.Lastly,data heterogeneity across centers slows down training convergence and degrades model performance.These factors collectively prolong training times and restrict the practical application of Geo ML systems.This dissertation tackles these challenges from three aspects: communication architecture,computation and communication scheduling,and data heterogeneity optimizations.The primary objective is to reduce training times by lowering transmission delays,blocking delays,and decreasing the number of training rounds.Accordingly,this dissertation develops a Geo ML system,named Geo MX,and introduces three novel optimizations.They complement each other,exhibiting significant improvements in training efficiency.The main contributions are summarized as follows:(1)We developed Geo MX,a high-performance Geo ML system for distributed training across data centers.It features Hi PS,a hierarchical parameter server architecture that reduces cross-center communications and enhances flexible parameter synchronizations.Geo MX further optimizes communication efficiency using five advanced techniques.Geo MX across centers delivers 4× faster training speeds than MXNET in a single center,and it stands out as the first open-source Geo ML system.(2)We propose ESync,a computing power-aware and adaptive synchronization mechanism that is free from blockages.By varying the number of inner-sync iterations,ESync ensures balanced computing times across centers and minimum blockage delays without compromising convergence.It utilizes a state server for computing power awareness and dynamically determines the action to be performed for each center.As a result,it outperforms synchronous(e.g.,FSA and Ti FL)and asynchronous(e.g.,DCAHA、Fed Async)algorithms in training speed,data throughput,network load,and computational balance.(3)We propose Net Storm,an adaptive communication scheduler for heterogeneous and dynamic networks.It minimizes transmission and blocking delays using a multirooted fastest aggregation tree topology,enhanced by a network awareness module.NetStorm employs a multi-path auxiliary transmission mechanism for faster parallel transmission.A consistency mechanism is designed to ensure seamless topology updating.Net Storm achieves a training speed up of 6.5~9.2 times compared to Geo MX and outperforms MLNET and TSEngine in both static and dynamic networks.(4)We propose Fed GS,an algorithm tackling convergence issues arising from data heterogeneity by selecting devices within centers to form super nodes with consistent data distribution.We propose GBP-CS to enable fast and precise device selection.A composed synchronization protocol is used to reduce data skew within centers while keeping low communication costs.Compared to Fed Avg,Fed GS reduces training rounds by 70%,improves model accuracy by 3.9%,and outperforms ten advanced algorithms in accuracy. |