| In recent years,deep learning has flourished in various fields of artificial intelligence.With continuous deep learning research,the scale of training models becomes larger and larger,and the arrival of the era of big data makes the amount of data explode,which makes big models and big data gradually become the mainstream of deep learning training.Traditional single-computer training is often unable to meet the training needs of large models and large data due to limited computing and storage capacity,so distributed deep learning with multi-computer collaborative training has emerged.Distributed deep learning collaborates with multiple training devices to complete the training task,which greatly expands the storage capacity and computation capacity of the system,accelerates the training process,and shortens the training time required for deep learning.However,multi-machine training also brings new problems and challenges.The frequent data transfer and parameter synchronization during the collaborative training process of multiple devices limit the further improvement of distributed deep learning performance.The traditional distributed deep learning training strategy mainly focuses on homogeneous clusters with the same node performance.However,the current equipment is rapidly updated,and the maintenance and upgrade of equipment in the cluster often result in different equipment models of training nodes and differences in computational communication and other capabilities.In such heterogeneous clusters with differences in node capabilities,it is difficult to achieve the same training results as in homogeneous environments within a given training time using traditional training strategies,mainly because the overall training speed and effectiveness are constrained by slow nodes.In addition,the inconsistent computational communication and other capabilities of the nodes within a heterogeneous cluster lead to poor resource utilization of the cluster as a whole.Therefore,reducing or eliminating the influence of slow nodes in the training process,accelerating model training,and improving the utilization of cluster computing and communication resources in complex heterogeneous environments is an urgent problem to be solved.The heterogeneous environments are complex and diverse,mainly manifested by the differences in computational and communication capabilities between nodes.Existing studies have given some solutions for distributed deep learning training in specific heterogeneous environments,however,most of the existing solutions only study and discuss the difference of one kind of capability(computational capability or communication capability)of nodes in heterogeneous environments,and lack the study of more general heterogeneous environments with differences in both computational capability and communication capability among nodes.In this thesis,we propose a new distributed deep learning training scheme that takes into account the differences in computational and communication capabilities among nodes in heterogeneous environments,eliminates unnecessary communication and waiting caused by slow nodes,shortens training time,and makes full use of cluster computational and communication resources.In this thesis,we propose an adaptive multi-stage distributed deep learning training scheme for heterogeneous environments,ASHL,which consists of three parts:the first part is an intelligent training task allocation mechanism,which first quantitatively evaluates the capability of each training node through short pre-training,so as to reasonably plan the training and communication tasks of each node and lay the foundation for formal training.The second part is the mixed-mode training model,which is divided into the AHL phase and SHL phase,with the goal of balancing the model accuracy and training speed.the objective is to improve the convergence of the model.Finally,a dynamic communication strategy based on compression is proposed to further improve the communication efficiency of the training process.The experimental results in this thesis show that ASHL can reduce the overall training time by more than 30%and has better generalization ability compared with the current more advanced schemes such as ADSP in achieving the same degree of convergence. |