In recent years,as a part of the field of artificial intelligence,deep neural network has received widespread attention from all sectors of the society.More and more researchers have joined in the theoretical and applied research of DNN.In order to improve the accuracy of the DNN model,the scale of sample parameters and model parameters is getting larger and larger,which presents new challenges for multi-core processor architectures.The multi-core processor architecture is still connected to one off-chip memory.Because of the number of processor cores has increased,there is a serious imbalance between the processor cores and the performance of the memory,which is the "memory wall" problem.During DNN training,a large number of parameters in memory need to be read or written frequently,and operation the memory access is very intensive.The existence of "memory wall" restricts the memory access operation of DNN training seriously,which leads to the increase of memory access delay.Therefore,in order to reduce the memory access delay during DNN training,the paper decreases the storage pressure of a single memory by distributing DNN parameters into multiple off-chip memories and reordering read-write memory.The main work and innovations of the article are as follows:(1)The storage characteristics of DNN parameters training have been analyzed.Through detailed decomposition of the DNN training process,the memory access characteristics of DNN training at a certain moment are obtained.This paper defines the probability of conflict when accessing memory,which is the number of operations issued on the processor core of memory operations per unit time divided by the number of memory.Then we get the ratio of the conflict between the single-memory architecture and the multi-memory architecture.The theoretical delay curves of single memory architecture and multi-memory architecture are obtained by combining memory reading and writing.(2)In the multi-memory architecture,we have proposed two different distributed parameter storage methods based on the load balancing distribution strategy.One is a fully connected distribution,which allows all processor cores to access data in all memory in turn.The other is group contiguous distribution,which groups all processor cores,and each group of processor cores can only access part of the data in the specified memory.(3)The on-chip network simulation modeling process is implemented with OPNET software.According to the design of the hardware resource architecture of the neural network model,we have carried out network simulation experiments.It is assumed that memory reading latency and writing latency is equal to each other.Experiments show that,the architecture of DNN training multi-core processor with multi-memory connection greatly reduces the total memory access delay during training.By comparing the distribution patterns of all-connected and group – connected storage schemes of DNN parameters,we get a conclusion that group-connected storage scheme is better.(4)Based on the simulation experiments of single memory and multiple memory architectures,the read and write performance is optimized based on the read and write characteristics of DDR DRAM memory.By considering the actual DDR clock cycles in response to memory access requests,the article considers the time interval of memory access requests as the memory response delay to read / write requests.By designing a read-write request queue exchange algorithm,the sequence of accessing memory requests is optimized,and the total access latency is reduced. |