Font Size: a A A

Performance Optimization Of Distributed Machine Learning Cluster System

Posted on:2021-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:X S HeFull Text:PDF
GTID:2428330623968238Subject:Engineering
Abstract/Summary:PDF Full Text Request
Massive big data and large-scale machine learning models have provided solid conditions for the development of artificial intelligence,and the large amounts of data and large-scale training models have also contributed to the rapid development of distributed machine learning in recent years.For the current distributed machine learning frameworks,low utilization of computing resources is a relatively common problem,and the vacancy of computing resources of computing nodes in the communication process brought by the tight coupling of computing and communication processes is the main reason for the low utilization rate of computing resources.Therefore remove the interdependence between computing and communication during training,improve the overlap ratio of computing and communication,and train in a parallel way can effectively improve the utilization rate of cluster computing resources.At the same time,in recent years,with the rapid development of computing equipment,the parameters of distributed machine learning training are updated more frequently,and the communication bottleneck caused by a large number of parameter updates is also becoming the bottleneck of distributed machine learning training.In particular,the communication bottleneck of distributed machine learning training is not only because of frequent communication,but also because most of the current training frameworks statically configure their logical topology,which cannot make the trained logical topology dynamically adapt to the physical topology of the underlying network,serious communication bottlenecks can occur when the underlying network changes significantly.Therefore,providing the training framework with the adaptive ability of the cluster network,that is,being able to perceive the underlying network and dynamically adapt the logical topology to the physical topology will alleviate the communication bottleneck of the framework and optimize the cluster communication performance.Under the above background,this thesis innovatively proposes a distributed machine learning performance optimization architecture-NPS.The architecture is based on the MXNet framework and optimizes the distributed machine learning cluster from the aspects of cluster training efficiency and communication performance.At first,the NPS architecture realizes parallelization by decoupling the calculation and communication processes of computing nodes to maximize the overlap ratio of calculation and communication,so that the calculation process during training is no longer affected by communication,and can fully utilize computing resources and improve the upper-layer business.Secondly,the network adaptive function of the NPS architecture is achieved through the dynamic perception of the cluster network and the logical topology switch,that enables the framework to nondestructively deal with or mitigate the communication bottlenecks that caused by the static configuration or cluster network changes during the training process,realize the dynamic adaptation of logical topology and physical topology,and then achieve the effect of improving the communication performance of the cluster.Especially,the NPS architecture encapsulate the upper-layer interfaces that provides good usability as well as ease of use to the users.In particular,the NPS architecture implemented in this thesis mainly focuses on the ability to implement network adaptation through logical topology switching,and does not do special research on the formulation of logical topology.In the end,the NPS architecture was tested through experiments.The experimental results show that the NPS architecture can effectively improve the cluster resource utilization and training efficiency of the distributed machine learning framework,and can improve the communication performance of the framework through its good network adaptability.
Keywords/Search Tags:Distributed machine learning, MXNet, Parallelization, Network adaptive
PDF Full Text Request
Related items