With the expansion of the data sets and neural network models,the computing and storage capacity of a single device is limited,which can no longer meet the growing training needs.In order to improve the performance of large-scale neural network model training,distributed model training across multiple devices has become a research focus in recent years.Data parallelism is a mainstream distributed training method at present.This method copies model copies to multiple device nodes and divides the data set into multiple sub-data sets for parallel training.However,data parallel can only solve the problem of training mass data,but cannot solve the problem of training caused by too large model scale and insufficient computing and storage resources of a single device.Therefore,it has become an inevitable trend to conduct model parallel training on the deep learning large model,that is,to divide the deep learning large model into multiple submodels and schedule them to different devices for parallel execution.At present,there is no mature solution for model parallelism,which is mainly based on expert experience and requires model developers to have knowledge of deep learning,distributed computing,computer architecture and other fields at the same time,which requires high knowledge for practitioners.In order to simplify the design process of parallel strategy,the academia and industry have proposed automatic model parallel methods.Currently,there are mainly automatic model parallel methods based on machine learning and graph algorithm,but these methods still have the following problems:(1)Low policy execution performance.The current method does not consider the search direction of the strategy comprehensively.It mainly takes training time,memory cost and other single dimensions as the guiding index,and lacks the evaluation model of fine-grained comprehensive measurement performance.For the model with complex structure and large scale,these methods are easy to fall into the local extreme value in the search process,which makes the performance of strategy execution low.(2)Low efficiency of strategy search.The current method needs to collect the operator’s baseline data(computation time,communication time,etc.)in the real environment in advance,and the method itself depends on the high-performance distributed environment.For the model with complex structure and large scale,the search time of these methods usually takes several hours to several days,which even exceeds the time required for the training of the model itself,so the search efficiency is low.In order to solve the above problems,in this paper,the heuristic algorithm is used to realize the model automatic combination behavior research direction,respectively from the iterative search and non-iterative search,comprehensive model distributed training performance quantitative modeling,model structure dependence and operator communication characteristics,etc.The research content of this paper is summarized as follows:(1)In order to solve the problem of low performance of strategy execution,this paper proposes the model automatic parallel method TGA based on two population genetic algorithm from the perspective of iterative search.Firstly,a multi-dimensional cost evaluation model of computing,communication and load fusion is established to evaluate the advantages and disadvantages of fine-grained evaluation strategies.Then,based on the cost evaluation model,the two-population genetic algorithm is guided to iteratively search for the optimal or approximate optimal solution in the simplified solution space.The results showed that compared with Baechi’s approach,TGA could improve the strategy execution performance by up to 42%.Compared with Hi-erarchical framework,the policy execution performance can be improved by up to 37.7%,and the problem of model memory overflow can be effectively avoided,which solves the problem of low policy execution performance for complex model search by the current method.(2)To solve the problem of low efficiency of policy search,this paper proposes a model automatic parallel method SGP(Swift Graph Partition)based on load balancing from the perspective of non-iterative search.Firstly,the operator’s communication cost,computational cost and comprehensive cost as well as the model’s training performance cost are established by using the operator’s characteristics,without the need to collect data in the real environment.Then,an automatic segmentation and scheduling method based on load balancing is proposed to quickly generate model parallel strategy.The experimental results show that the SGP method can search for parallel training strategies with excellent execution performance in a few seconds.Compared with TGA method,the strategy search speed is up to 110.72 times faster.Compared with the Baechi framework,the maximum strategy search speed was improved by 33.51 times,which solves the problem of low efficiency of the current approach for searching complex models. |