Research On Dynamic Scheduling Method For Distributed Machine Learning Tasks

Posted on:2024-06-22

Degree:Master

Type:Thesis

Country:China

Candidate:W B Zhu

Full Text:PDF

GTID:2568307103475084

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the increasing amount of data in modern society,machine learning algorithms have become a key technology in many modern application services.However,relying solely on a single computing node for machine learning training may face challenges such as insufficient memory and time-consuming training.Due to the communication and synchronization between computing nodes,the problem of parameter lag in distributed machine learning models has become an important factor affecting their performance and reliability.For the problem of parameter lag in distributed machine learning models,most of the existing solutions are aimed at optimizing communication and synchronization protocols,such as adopting more efficient communication protocols,optimizing synchronization strategies,and so on.Although these methods can alleviate the impact of model parameter lag,they still have certain drawbacks,such as excessive dependence on hyperparameters,resulting in algorithms falling into local optimization,or inability to adapt to the uncertainty of cluster node performance changes in production environments,which cannot completely solve the model lag problem caused by asynchronous communication.Therefore,this paper proposes a dynamic scheduling method for distributed machine learning tasks,which is a weight-based load balancing strategy(WLBS)and a weight based adaptive load balancing strategy(Auto WLBS),to solve the problem of large training costs caused by parameter delays and synchronization in distributed machine learning models.The core content of this study includes the following aspects:1)Aiming at the static allocation of LSP models and the existing parallel computing strategies that cannot completely solve the problems of large errors caused by model parameter delays and high training time costs,a task allocation and optimization strategy,WLBS,was proposed.This article first explains the limitations of static allocation and existing parallel communication strategies in implementing distributed model training in heterogeneous cluster environments,then presents the policy design and algorithm implementation of WLBS.By combining LSP model with WLBS,a W-LSP model is proposed.2)In view of the limitations of WLBS being unable to adapt well to real-time changes in node performance,further optimization is carried out to achieve the effect that the actual allocation of tasks is closer to the ideal allocation of tasks.Therefore,Auto-WLBS is proposed,which predicts the performance of nodes through a circular neural network and allocates tasks proportionally to the predicted results.The AW-LSP model is proposed by combining Auto-WLBS and LSP models.In addition,based on various parallel models and the idea of parameter server architecture,this paper builds an experimental environment for the distributed machine learning system Kunal ML,and uses W-LSP model and AW-LSP model to conduct distributed machine learning training on this experimental environment,respectively,thereby verifying the correctness and effectiveness of WLBS and Auto-WLBS.Through experimental verification,the AW-LSP model and W-LSP model have shortened training time by 5.40 and 13.56 percentage points respectively compared to the LSP model in the 4-process scenario,and improved model accuracy by 1.13 and 3.11 percentage points respectively.

Keywords/Search Tags:

Distributed Machine Learning, Parameter Server Architecture, Task Allocation and Optimization, Dynamic Scheduling

PDF Full Text Request

Related items

1	Research On Scheduling Optimization Of Distributed Machine Learning System
2	Communication Dynamic Optimizing Technology For Distributed Machine Learning
3	Research On Parameter Communication Optimization For Distributed Machine Learning System
4	Optimization Techniques For Distributed Machine Learning Systems
5	Research On Task Scheduling And Virtual Machine Resource Optimization Allocation In Cloud Environment
6	Research On Task Scheduling Algorithms For Distributed Systems Based On Computational Intelligence
7	Towards Architecture Design And Performance Evaluation For Distributed Machine Learning Systems In Datacenter Networks
8	Research On Key Technologies For Training Efficiency Optimization Of Distributed Machine Learning Over WAN
9	Design And Implementation Of High Availability Scheduling System Based On Distributed Architecture
10	Research On Radio And Television Transmission Task Allocation And Scheduling Optimization Based On Heuristic Intelligence Algorithm