Font Size: a A A

Optimization And Acceleration Of Convolutional Neural Network

Posted on:2020-07-14Degree:MasterType:Thesis
Country:ChinaCandidate:J C WangFull Text:PDF
GTID:2518305732998859Subject:Microelectronics and Solid State Electronics
Abstract/Summary:PDF Full Text Request
Convolutional Neural Network(CNN)has become one of the most popular deep learning algorithms due to its remarkable performance in image,voice and text fields.Real-time CNN implementations in resource limited embedded systems are becoming highly desired,despite its inherent high computational complexity.It has great prospects that CNN based applications like image generation and voice recognition are realized in mobile devices.It is the trend of CNN implementations as well.All that happen attribute to high accuracy and low power CNN implementation techniques.In the early development of deep learning,both efficient training and fast inference are realized on specific graphic processing unit(GPU).However,the computational complexity cannot be reduced on GPU.With the rapid development of CNN,various CNN models have been emerging.For example,from the early base nets like AlexNet[1],VGGNet[2],ResNet[3],to high accuracy nets like DenseNet[4],FractalNet[5],to light weight nets like MobileNet(V1,V2)[6,7],SqueezeNet[8],to famous generative adversarial networks(GAN)[9]nowadays.Although the model structures of these nets are quite different,the basic operations inside are similar.Convolutions occupy more than%90 computations in CNN implementation,need tremendous on-chip storage resources and consume large amounts of power.Based on these factors,we focus on the reduction of CNN complexity,the reduction of on-chip storage resources and the optimization of external bandwidth.In the up-to-date GAN model,deconvolution,which is the inverse operation of convolutions,are included.If deconvolutions are implemented by traditional convolution approaches,it will cause the problems of redundant computations and memory overhead.Deconvolution is also what we want to optimize.Based on parallel fast finite impulse response(FIR)algorithm(FFA),the efficient implementation of traditional convolution is designed.Based on efficient deconvolution transform method,we transform deconvolution to convolution equivalently.Thus the acceleration of deconvolution can be performed by regular convolution accelerator.Based on layer fusion and resource partition scheme,the required on-chip resources are considerably reduced and the external memory bandwidth is efficiently utilized.The problem of bandwidth imbalance is solved.In this paper,we first derive 3 and 5 parallel FFA theoretically.Then,based on these algorithms,we design 3 and 5 parallel fast convolution units(FCU),respectively.The multiplications of 3 × 3 and 5 × 5 convolutions can be saved by 30%and 40%respectively.The reconfigurable FCU is designed in order to reduce computational resources.We implement our designs on Xilinx FPGA platform.We outperform similar works by 2x in terms of resource utilization.The proposed efficient storage architecture save 14x memory resources compared to traditional approaches and can store all intermediate results on chip.The demo design achieves 33fps of image classification of 224x224,which is 3x of similar works.The proposed bandwidth efficient architecture applies resource partition and computation pipeline,which increases system output incredibly and reduces bandwidth by 2x compared to similar works.
Keywords/Search Tags:Convolutional Neural Network(CNN), Convolution and deconvolution, optimization and acceleration, bandwidth and storage optimization, FPGA
PDF Full Text Request
Related items