Font Size: a A A

Research On Scheduling Optimization Of Distributed Machine Learning System

Posted on:2021-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhouFull Text:PDF
GTID:2428330605481145Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the times,the scale of machine learning training set and the complexity of the model are increasing.The single machine training model cannot adapt to the large-scale data environment.In recent years,distributed machine learning has gained more and more attention because of its massive data processing ability and flexible scalability.Most of the distributed machine learning systems are based on parameter server.In the distributed system,node crash and network outage are random,which leads to poor scalability and robustness when the parameter server system adopts static scheduling.The heterogeneity between nodes leads to poor portability and adaptability of parameter server system.There are multi-user and multi-task shared resources among nodes,which results in performance difference between nodes and prolongs synchronization time.Synchronization is a necessary means to ensure the effectiveness of parallel training.The overall synchronization and parallel strategy ensure that the training accuracy is close to that of single machine training,but it will generate a large communication overhead and be easily affected by the performance differences between nodes.The asynchronous parallel strategy greatly reduces the synchronization time,but it cannot guarantee the convergence and model accuracy.The delay synchronization strategy achieves a balance between synchronization time and model accuracy.However,this strategy is not suitable for environments with performance differences.It will degenerate into a global synchronous parallel strategy under extreme conditions.Therefore,the scheduling optimization of the distributed machine learning system has become the key factor to ensure the high portability,high reliability,high adaptability and low synchronization cost of the system.To solve these problems,this paper takes scheduling optimization as the research direction to solve the problems of poor portability,inability to cope with the dynamic changes of training resources and poor adaptability in the cluster.This paper innovatively puts forward two scheduling optimization strategies:dynamic scheduling strategy and adaptive scheduling strategy.This paper implements ParaisoML,a distributed machine learning system,based on the above two dynamic scheduling strategies.The main contents of this paper are as follows:(1)In this paper,the defects of static scheduling are analyzed,and a new dynamic scheduling strategy is proposed.Static-scheduling is not aware of the dynamic changes of node resources in parallel training.It has poor portability among different clusters and is difficult to improve scalability.The dynamic-scheduling strategy is suitable for scenarios where resources change dynamically.It can adjust the resources allocated for training according to the changes of system resources,improve portability and scalability.In addition,dynamic-scheduling strategies can effectively alleviate performance differences between nodes and reduce synchronization time.After the dynamic-scheduling strategy is designed and implemented in this paper.It is demonstrated in theory.Relevant results show that the strategy guarantees convergence and the loss of model accuracy is acceptable.(2)In this paper,the dynamic-scheduling strategy is extended and optimized,and an adaptive-scheduling strategy is proposed.The dynamic-scheduling strategy can adjust nodes and resource allocation according to resource changes,but it cannot correct the inherent performance differences between nodes and unpredictable temporary dynamic resource changes.The adaptive-scheduling strategy supports dynamic join and exit of nodes,reduces the performance gap between nodes through data partition.In addition,the adaptive scheduling strategy alleviates the bottleneck that the convergence times of different model iterations are not interpretive by analyzing the law of model accuracy changes.This paper discusses the design and implementation of adaptive scheduling strategy.The results show that the strategy can further reduce the synchronization time.(3)This paper designs and implements ParaisoML,a distributed machine learning system based on dynamic-scheduling strategy and adaptive-scheduling strategy.The system is mainly composed of communication system,resource detection system and task scheduling system.The communication system is based on the network file system to provide reliable data communication services.The resource detection system uses sigar,an open source toolkit,to sample the resource utilization of nodes and provide services for the task scheduling system.Task scheduling system analyzes resource sampling information and allocates training nodes and resources.On the basis of random sampling and random segmentation,the data partition strategy realizes the task load balancing by random increment and dynamic increment.(4)In this paper,the performance of common synchronization strategies in distributed machine learning system is tested and analyzed.Then the portability,scalability and adaptability of ParaisoML are tested and analyzed.The experimental results show that the ParaisoML proposed in this paper can reduce the synchronization time and has good portability,scalability and adaptability under the premise of high accuracy and convergence rate.
Keywords/Search Tags:Distributed Machine Learning, Parameter Server System, Dispatching Optimization, Dynamic Scheduling, Adaptive Scheduling
PDF Full Text Request
Related items