Font Size: a A A

Design And Implementation Of ResNet Convolutional Network On Multi-core Vector Processor

Posted on:2021-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:K CaoFull Text:PDF
GTID:2518306050468594Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Convolutional Neural Networks(CNN)has shined in the fields of computer vision and natural language processing in recent years,and is a representative algorithm in deep learning.The Res Net convolutional network model is the first CNN algorithm with the ability to distinguish images over humans,and the depth reaches 152 layers for the first time.But CNN algorithms usually contain a large number of multiply-add calculations.Both the training and inference processes are difficult to quickly obtain results in traditional CPUs.So the researchers proposed a scheme to accelerate CNN on various hardware platforms.X-DSP is a multi-core vector processor for high-performance computing independently developed by the Institute of microelectronics,National University of Defense Technology.It adopts a Very Long Instruction Word(VLIW)architecture with 11 launches.Multiply Accumulate(MAC)is very suitable for accelerating convolutional neural networks.So it is very necessary for us to design the implementation scheme of Res Net convolutional network model for its specific architecture.This thesis deeply analyzes the internal computing characteristics of the algorithm of each module of convolution,Pooling,and full connectivity and the architectural characteristics of the multi-core vector processor.A three-dimensional image and convolution kernel data are converted into a two-dimensional matrix.The matrix vectorization implementation of row calculation.The specific implementation is:(1)Convolution calculation vectorization: the image data is based on the number of channels,and the image data of each channel is expanded into a two-dimensional matrix.We propose to transfer all channels of a convolution window into the SRAM as sub-blocks each time.The convolution kernel data is based on the number of convolution kernels,and each convolution kernel data is expanded into a two-dimensional matrix for rows,and then divided into sub-blocks and transmitted to the AM.Each calculation reads a scalar image data from the SRAM,then broadcasts it into a vector with the same value,and reads a vector convolution kernel data from the AM for multiply-accumulate calculation.That is,one image data can be calculated simultaneously with multiple convolution kernel data.(2)Pooling calculation and vectorization: The image data is based on the number of channels,the image data of each channel is expanded into a two-dimensional matrix,and then divided into sub-blocks and transferred into the AM.Each calculation reads two image vectors from the AM in a row for comparison,takes the larger value and compares it with the next vector image data,that is,the image data comparison of multiple channels can be performed in one shot.(3)Fully-connected computing vectorization: The fully-connected layer kernel computing logic is the same as convolutional computing,using the same vectorization method as convolutional computing.How to block the image data and the convolution kernel data is a difficult point to achieve.For example,in the convolution layer,the data of all channels of one convolution window is transferred to the SRAM at a time.The convolution kernel data block scheme is based on the system.The structural characteristics are determined.The number of VPEs for a single core of the X-DSP is 16 and one VPE has 3 MAC units.It can perform 48 multiply-add calculations at the same time.Capacity determined.In the data transmission,a "ping-pong" type of two-stage double-buffered DMA data transfer method is used to smooth the waiting overhead before storage at all levels.The design scheme fully developed the multi-core parallelism of the multi-core vector processor,the vector SIMD parallelism among the VPEs in a single core,the parallelism of multiple FMAC computing units in a VPE,and the Padding was implemented when implemented in assembly language.Optimization techniques such as delay slots,software pipeline,and instruction-level optimization.Finally tested on the X-DSP hardware simulator,the single-precision convolution single-core core computing performance can reach 189.96 GFLOPS,which is very close to the peak performance of 192 GFLOPS,and the efficiency is as high as 98.94%.
Keywords/Search Tags:Convolutional neural network, Vectorization, Convolution, Pooling, Full connection
PDF Full Text Request
Related items