| With the continuous development and expansion of deep learning,speech recognition,recommendation systems and other fields have been widely studied and applied.At the same time,with the rapid development of various technologies in modern society,the scale of collected data and information has also experienced explosive growth.In order to better fit the actual needs of users,the applied deep learning models are becoming more and more complex.Although the performance of deep learning models is getting better and better,a new problem has also arisen,that is,the time and computing resources required to train deep learning models also increase gradually with the increase in the amount of data.Distributed learning systems are also widely used in the Internet field.To improve the task throughput of a distributed machine learning system,various resources need to be fully invoked to meet the changing needs of distributed learning tasks.This paper studies three aspects of resource management,data copy management and computing resource management in distributed systems.In the direction of resource management,this paper uses the ARIMA model and the GRU model to jointly complete the traffic forecasting work,evaluate the subsequent distributed learning task volume,and then use the Exponentially Weighted Moving-Average(EWMA)model to predict the computing resources required by the future task volume.,the experimental results show that this paper can relieve the computing pressure of each computing node,thereby increasing the throughput of the distributed system;in the multi-copy management scheme,this paper uses the hot and cold degree to evaluate the data according to the amount of access.When it is larger,the data is hot data,and it is necessary to copy the data exceeding the threshold to increase the number of copies to achieve load balancing.For cold data with less access,it is necessary to reduce the data below the lower limit.The number of copies,so that the data server has more space to store hot data.In addition,this paper uses a heuristic algorithm to select the data server to provide a guarantee for the storage of data copies.The method can provide reasonable feedback on the amount of data access,and the data server selected according to the heuristic algorithm can meet the user’s access requirements and the storage requirements of data copies;in terms of resource scheduling,this paper adopts the reinforcement learning method to manage limited computing resources,To ensure that computing resources can be reasonably allocated to different parameter servers,experiments show that the reinforcement learning algorithm used in this paper can optimize the resource scheduling method of distributed learning clusters to a certain extent. |