Deep learning technology has been widely used in image,speech,natural language processing and other fields.With the continuous expansion of the scale of neural networks and datasets,how to improve the operating efficiency of the model has become a huge challenge.In recent years,the emergence of multi-core Artificial Intelligence chips based on pipeline parallel mode provides an effective solution to this problem.How to deploy neural network models on the physical cores of multi-core Artificial Intelligence chips has become a valuable research question.The key to this problem is how the computational graph representation of the neural network is placed on multi-cores.The main research content of this thesis is how to generate a placement on multi-cores for the subgraphs formed by the division of the computational graph,so as to reduce the cost of inter-core communication.In this thesis,the computational graph placement problem is first described in an abstract way and modeled as a Markov decision process.Secondly,placement algorithms based on deep reinforcement learning methods such as REINFORCE,DQN and PPO are designed to optimize the devision process,and these three algorithms achieve high-quality placement on small-scale chips.In view of the limitations of the above three single-agent mode algorithms in large-scale chip placement,an improved placement algorithm based on asynchronous mode Ape-X and an improved placement algorithm based on asynchronous mode APPO are designed.In order to further shorten the training time,distributed training is tried.Finally,based on the algorithm research results,a prototype system of computational graph physical cores placement is developed,which realizes the functions of placement model training,placement scheme generation and placement results display.The main contributions and innovations of this thesis are as follows:(1)An environment description method for the computational graphs placement on physical cores is defined,and an action constraint strategy named CORES-MASK is designed to dynamically update the range of optional actions for the feature that the same core cannot be selected repeatedly in the placement environment;(2)Multi-agent parallel sampling under the asynchronous architecture effectively increases the randomness and richness of the data and increases the upper limit of the reward during the training process.This solves the limitation that it is difficult to learn effectively in the single-agent sampling mode;(3)A reward-based two-end priority sampling algorithm is proposed.Taking the reward of each trajectory as the priority,the data is sorted.Train data is extracted from both ends according to the priority in the process of training,so as to maximize the probability of selecting high reward actions and reduce the probability of low reward actions being selected.The experimental results shows that this sampling method effectively improves the slow rise of early rewards in the training process of APPO algorithm,and also the final convergence effect. |