Font Size: a A A

Towards Architecture Design And Performance Evaluation For Distributed Machine Learning Systems In Datacenter Networks

Posted on:2022-08-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q H ZhouFull Text:PDF
GTID:1488306557962879Subject:Access to information and control
Abstract/Summary:PDF Full Text Request
Datacenter networks serve as a fundamental infrastructure for big data and cloud computing platforms,which are widely used for the deployment of data-intensive applications and large-scale parallel jobs.Considering the huge overhead of computational capacity and network bandwidth,these jobs often resort to the support of distributed processing.Over the past decade,the rapid development of Artificial Intelligence technologies makes machine learning applications dominate the workload inside datacenter networks.Conventional distributed processing systems,such as the pertinent Hadoop and Spark,are not specifically optimized to match the properties of distributed machine learning applications,thus cannot perform well in metrics of accelerating iteration speed,preserving model quality,reducing inference latency and saving system overhead.Therefore,it is worth designing dedicated distributed frameworks for machine learning applications and evaluating system performance in realistic deployment.The architecture design of distributed machine learning systems closely depends on the development of commercial hardware and software.Although existing distributed systems are implemented via the support of powerful computational capacity and large-scale data in datacenter networks,their performance is still bounded by the many realistic challenges,including topology mismatching,limited bandwidth,heterogeneous environments,straggler problems and huge resource demands.Previous work often focuses on the improvement of single metrics and cannot provide a comprehensive optimization from the system perspective.Based on the above analysis,this dissertation aims at breaking the performance bottleneck of handling distributed machine learning tasks and designing a high-performance system for heterogeneous datacenter networks,providing uniform interactions for large-scale machine learning applications.In order to conduct a software and hardware co-design,this dissertation decouples the whole system into four core modules,each of which is designed to meet the following four objectives.(1)As to enabling the compatibility of underlying network topology,this dissertation analyses the multi-interface communication patterns and network topology characteristics,aiming at improving the gradient exchanging efficiency and model convergence rate.Specifically,this dissertation designs the cluster topology driver module based on gradient exchanging algorithms and collective communication interfaces.The target of this module is to make distributed processing framework adaptive to different commodity topologies.By employing the decentralized gradient exchanging and piece-level collective communication,this module can fully exploit the parallel transmission capacity of multi-interface networks and improve the model synchronization efficiency of machine learning applications.Realistic deployment in the decentralized distributed model training environment demonstrates that the proposed topology driver can save 56.28%communication overhead and bring 2.84 times speedup to model convergence rate.(2)Considering the interaction deficiency caused by limited bandwidth,this dissertation focuses on the optimization of traffic scheduling,bandwidth assignment and dynamic compression,aiming at minimizing flow completion time and communication overhead.Specifically,this dissertation designs the compression-aware coflow scheduler module for datacenter networks.The core of this module is to study the strategy of flow scheduling and data compression by proposing the Fastest-Volume-Disposal-First heuristic algorithm,which saves the communication overhead and improves the efficiency of data-intensive applications.Realistic deployment based on the Hi Bench workloads demonstrates that the proposed coflow scheduler module brings 1.47 times acceleration and 48.41% reduction to average coflow completion time and communication traffic,respectively.(3)In order to match the heterogeneous environment and solve the straggler problem in cluster coordination,this dissertation inspects the heterogeneity-aware parameter synchronization and anti-straggler mechanisms,aiming at improving computational resource utilization and task processing speed.Specifically,this dissertation designs the heterogeneity-aware parameter server module based on straggler projection analysis and computational parallelism control.This module captures the application-level pattern of distributed machine learning tasks and mitigating inter-node computational gap by controlling the training parallelism.Meanwhile,this module employs the container-level task transferring to balance workloads and fully utilize cluster resources.These methods jointly mitigate the straggler problems in the heterogeneous environment.Realistic deployment based on the GPU/CPU heterogeneous cluster demonstrates that the proposed parameter server module can reduce 86.16% straggler delay time and bring 3.19 times acceleration to training iteration speed.(4)Focusing on saving the huge resource cost of model training procedure,this dissertation elaborates the workload placement for green machine learning applications in the resource-constrained environment,aiming at minimizing job completion time and energy cost.Specifically,this dissertation designs the distributed training manager based on model partition and computing graph assignment.This module jointly optimizes the processing speed and energy cost for machine learning tasks.By using the hierarchical hybrid synchronization mechanism,this module can effectively coordinate the cluster workload and enable green machine learning in the resource-contained environment.Realistic deployment based on image classification tasks in the edge environment demonstrates that the proposed distributed training manager can bring 4.67 times acceleration to task processing speed and save 68.92% energy cost.The above four modules construct the distributed machine learning system for datacenter networks.This dissertation conducts the software and hardware synergy in terms of underlying network topology,middle interaction,and upper application demands,providing effective performance improvements on distributed training efficiency,task processing speed and energy saving.This dissertation can serve as theoretical guidance for future research and promote the development of high-performance computing,so as to address the crucial issues of fault tolerance,cluster scalability,resource utilization and inference latency.Also,the open-source modules in the proposed system can provide convenience for both researchers and developers,which will benefit various demands in real-world industry scenarios.
Keywords/Search Tags:Datacenter Networks, Distributed Processing, Machine Learning Systems, Parameter Server, Heterogeneous Environment
PDF Full Text Request
Related items