Font Size: a A A

Research On Dynamic Scheduling Method For Distributed Machine Learning Tasks

Posted on:2024-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:W B ZhuFull Text:PDF
GTID:2568307103475084Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the increasing amount of data in modern society,machine learning algorithms have become a key technology in many modern application services.However,relying solely on a single computing node for machine learning training may face challenges such as insufficient memory and time-consuming training.Due to the communication and synchronization between computing nodes,the problem of parameter lag in distributed machine learning models has become an important factor affecting their performance and reliability.For the problem of parameter lag in distributed machine learning models,most of the existing solutions are aimed at optimizing communication and synchronization protocols,such as adopting more efficient communication protocols,optimizing synchronization strategies,and so on.Although these methods can alleviate the impact of model parameter lag,they still have certain drawbacks,such as excessive dependence on hyperparameters,resulting in algorithms falling into local optimization,or inability to adapt to the uncertainty of cluster node performance changes in production environments,which cannot completely solve the model lag problem caused by asynchronous communication.Therefore,this paper proposes a dynamic scheduling method for distributed machine learning tasks,which is a weight-based load balancing strategy(WLBS)and a weight based adaptive load balancing strategy(Auto WLBS),to solve the problem of large training costs caused by parameter delays and synchronization in distributed machine learning models.The core content of this study includes the following aspects:1)Aiming at the static allocation of LSP models and the existing parallel computing strategies that cannot completely solve the problems of large errors caused by model parameter delays and high training time costs,a task allocation and optimization strategy,WLBS,was proposed.This article first explains the limitations of static allocation and existing parallel communication strategies in implementing distributed model training in heterogeneous cluster environments,then presents the policy design and algorithm implementation of WLBS.By combining LSP model with WLBS,a W-LSP model is proposed.2)In view of the limitations of WLBS being unable to adapt well to real-time changes in node performance,further optimization is carried out to achieve the effect that the actual allocation of tasks is closer to the ideal allocation of tasks.Therefore,Auto-WLBS is proposed,which predicts the performance of nodes through a circular neural network and allocates tasks proportionally to the predicted results.The AW-LSP model is proposed by combining Auto-WLBS and LSP models.In addition,based on various parallel models and the idea of parameter server architecture,this paper builds an experimental environment for the distributed machine learning system Kunal ML,and uses W-LSP model and AW-LSP model to conduct distributed machine learning training on this experimental environment,respectively,thereby verifying the correctness and effectiveness of WLBS and Auto-WLBS.Through experimental verification,the AW-LSP model and W-LSP model have shortened training time by 5.40 and 13.56 percentage points respectively compared to the LSP model in the 4-process scenario,and improved model accuracy by 1.13 and 3.11 percentage points respectively.
Keywords/Search Tags:Distributed Machine Learning, Parameter Server Architecture, Task Allocation and Optimization, Dynamic Scheduling
PDF Full Text Request
Related items