Font Size: a A A

Research On Network Traffic Scheduling Mechanism Of Distributed Machine Learning

Posted on:2022-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:Y L HeFull Text:PDF
GTID:2518306524984419Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Benefit from the promotion of big data,large models,and GPU clusters,artificial intelligence technology has developed rapidly,but it is no easy task to train an excellent artificial intelligence model that is more accurate and expressive on this basis.This has also promoted the rapid development of parallel and distributed machine learning tech-nology.In the current distributed machine learning framework,the parameter calculation and parameter communication process of computing nodes are tightly coupled in series,resulting in low utilization of computing resources.At the same time,with the rapid de-velopment of dedicated high-speed computing hardware equipment in recent years,the growth of computing power is far ahead of network data transmission capabilities.High-speed computing equipment makes the parameters of distributed machine learning more frequent.Parameter communication has become the performance bottleneck of distributed machine learning systems.How to reduce the time ratio of system communication and calculation,balance calculation and communication,and improve the utilization of com-puting resources is the key to improving the performance of distributed machine learning systems.Starting from the algorithm level and the network level,the thesis studies the distributed machine learning network traffic scheduling mechanism,optimizes the com-munication efficiency of the distributed learning system,and improves the performance of the distributed machine learning system.1.From the perspective of algorithm,the thesis studied the traffic transmission mech-anism of distributed machine learning based on multi-priority and multi-path,optimized the back propagation algorithm in a distributed training environment.First,the thesis proposed a no-wait back-propagation algorithm,which overlaps the back-propagation calculation process and parameter communication,and assigns different communication priorities to the parameters of each layer,and proposed a priority-based no-wait back-propagation algorithm.Decoupling the calculation and parameter communication process of the computing node,reducing the time ratio of system communication and computing,and improving the utilization of computing resources.Then based on the existence of multiple non-repetitive physical communication links between computing nodes,the the-sis designed a multi-path parallel parameter synchronization scheme to further reduce the proportion of parameter communication time and improve the performance of the dis-tributed machine learning system.The simulation shows that the designed algorithm can reduce the communication and calculation time ratio of the distributed system and has good performance.2.From the perspective of network,the thesis studied the the mechanism of traffic congestion control in distributed machine learning clusters,especially the ”TCP Incast”problem in the distributed machine learning framework based on the parameter server ar-chitecture.The thesis designed a network congestion control mechanism called SCC to dynamically adjust the source sending rate to avoid or alleviate network data packet loss overtime and retransmission.Reduce parameter communication delay,reduce the pro-portion of communication time of the distributed machine learning system,and improve the performance of the distributed machine learning system.The experimental simula-tion results show that the SCC mechanism effectively improves the network throughput,reduces the flow completion time,and has a good performance compared with the com-parison proposal.
Keywords/Search Tags:Distributed Machine Learning, Backpropagation Algorithm, Multipath Parameter Synchronization, Congestion Control
PDF Full Text Request
Related items