Font Size: a A A

Design And Implementation Of Three-Dimensional Array Smart Chip Architecture For Video Processing Applications

Posted on:2022-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:M LeiFull Text:PDF
GTID:2518306605470034Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the application and promotion of the Internet,Big Data and Artificial Intelligence technology,Deep Learning has achieved remarkable results in the fields of computer vision,autonomous decision-making by agents and natural language processing relying on its powerful feature extraction ability.The rapid development of Deep Learning technology makes its computing acceleration method and architecture design become a hot topic in academia and industry.Today's mainstream Deep Learning accelerators have a good speed-up effect on the intelligent processing of single frame image.However,for video applications,the direct use of single frame image acceleration technology will cause a great waste of hardware resources,as well as a large number of repeated read-write operations of off-chip memory.The two types of input data for Deep Convolutional Neural Networks-input feature maps and weights both support data reuse.For input feature map reuse,multiple convolution kernels act on the same input feature map;while for convolution kernel reuse,data in different sliding windows on the input feature map use the same weight,and when multiple input feature maps are processed at one time(Called Batch Processing),use the same convolution kernel on different input feature maps.These three types of data reuse essentially take advantage of the three types of parallelism in convolutional layer calculations: parallelism between output channels,parallelism between sliding windows,and parallelism between input feature maps.The existing Multi-Weight and Multi-Thread processing technology based on data reuse can effectively speed up the single-frame image processing,but there are many shortcomings: when the input data is read from the memory and sent to the computing unit array for calculation,the bit width of the read input data is generally far less than the bandwidth of the memory,resulting in the extremely low utilization of the memory bandwidth;this acceleration technology only uses the parallelism between sliding windows and output channels in convolution layer computing,but does not use the parallelism between input feature maps;when dealing with full connection layer,according to the computing characteristics of full connection layer,only one thread will be used in single frame image computing,and the hardware resources of redundant threads will be wasted.In this paper,the Multi-Weight and Multi-Thread technology based on data reuse is extended to batch processing of multi frame images.The multi frame image preprocessing module is designed.The data of the corresponding pixel position of the continuous frame image is spliced to make the bit width of the input data close to the memory bandwidth.The number of images preprocessed each time is selected according to the bandwidth of the memory.According to the three parallelism of the convolution layer calculation,the three-dimensional computing array architecture is designed to calculate the processed data of the multi frame image preprocessing module Input data;design the computing scheduling module,and make different buffering,calculation,output and storage strategies of multi frame image data according to the calculation characteristics of different network layers: the strategy of using the computing unit multiplexing in convolution layer,the strategy of time division multiplexing in pool layer and non first layer convolution layer and the strategy of using space division multiplexing in the whole connection layer.This multi frame image batch processing technology not only improves the utilization of memory bandwidth,reduces the number of times to read the input characteristic data and weight data from memory,but also reduces the access power of off-chip memory;in the process of computing the whole connection layer,the strategy of space division multiplexing can use multiple threads to process the data of different frames in parallel,and reduce the waste of hardware resources.The correctness of the design is verified by deploying Le Net-5 and Alex Net convolutional neural networks on the accelerator.Aiming at the small network Le Net-5,Questasim simulation and FPGA testing methods are adopted,100 handwritten digital pictures in the MNIST data set are used as experimental samples,and the accuracy of its classification by multi-frame image batch processing technology reaches 99%.For the test of large-scale network Alex Net,the strategy of combining FPGA test and logic analyzer debugging in Vivado is adopted.Compared with single-frame image processing technology,this technology makes the overall throughput of the system three times the original.
Keywords/Search Tags:Rearrange, Data Reuse, Multi-Weight and Multi-Thread, Batch Processing, FPGA Verification
PDF Full Text Request
Related items