VLSI Optimizations And Implementations For Convolutional Neural Networks

Posted on:2020-03-13

Degree:Master

Type:Thesis

Country:China

Candidate:Y Z Wang

Full Text:PDF

GTID:2428330575955107

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

Convolutonal Neural Networks(CNNs)based deep learning algorithms have al-ready achieved great success in several fields,such as image classification,motion/speech recognition,etc.However,while CNNs show their impressive performances,massive computational complexity and memory requirement prevent them from being widely used.As Moore's law is slowing down,it's impossible to obtain more computation easily by only technology scaling,which raises big challenges to build a CNN based embedded system since it has limited hardware resources and need very high energy ef-ficiency.In this paper,we try to solve this problem by building specific VLSI hardware architectures for embedded systems featured with CNNs.On one hand,an energy efficient hardware architecture for Binary weight CNNs(BCNNs)is proposed.BCNNs compress their parameters to only +1 and-1 to reduce the model sizes by an order of magnitude,while they can still maintain acceptable ac-curacies for certain applications.Thus,BCNNs make it possible to achieve very high throughputs with moderate power consumption,and are very suitable for embedded systems which should be extremely energy efficient.The BCNN hardware architec-ture proposed in this paper fully exploits model's binary weights and other hardware-friendly characteristics.It schedules the processing flow judiciously to minimize off-chip DRAM access and maximally reuse all input activations so that the memory power is reduced significantly.Besides,several micro-architectural level innovations are in-corporated,e.g.,dedicated compressor trees and approximate computing with com-pensation schemes.As a result,the overall hardware complexity and critical paths decrease.Experimental results on several datasets demonstrate insignificant loss on model accuracies.Postlayout implementation results show that the thoroughputs and energy efficiency of this architecture are 4.14x and 2.37x of that in prior art,respec-tively.On the other hand,targeting at embedded applications that requires both flexibil-ities in data quantization/computation precisions and high energy efficiency,this pa-per proposed another precision-adjustable sparse CNN hardware architecture,namely Folded Precision Adjustable Processor(FPAP).One of the most challenging parts in designing an efficient hardware architecture for CNNs is to eliminate large amount of unnecessary computations caused by the sparsity associated with both weights and ac-tivations.Another challenge is to enable the designed hardware to adapt to variable data precision requirements across layers.Unfortunately,existing works can only re-solve one of these two design challenges.The FPAP architecture eliminate all com-putational redundancies while achieving high processing speed by exploring globally parallel and locally serial that is based on folding transformation.Combined with a ded-icated data encoding scheme,the dominating Multiply-Accumulate(MAC)operations(OPs)and 1-D convolutions(FIR filters)are decomposed into multiple adds.These adds are folded into single arithmetic units and then calculated serially.In this way,FPAP is able to exclusively perform effective computations.It is capable of adapting to variable data precisions while exploiting the sparsity of both weights and activations to further reduce computational complexity.Besides,this paper also explores to re-move the fine-grained redundancies brought by add-zero OPs within already reduced-precision MACs,and to minimize the overall workloads by dynamically deciding to decompose either activation or weight in a per-MAC granularity.Additionally,to mit-igate the load imbalance issue caused by irregular sparsity,a novel genetic algorithm based kernel reallocation scheme is introduced to significantly improve the throughput.Experimental results on real CNN models demonstrate that the proposed architecture can achieve very high energy efficiency ranging from 4.28TOP/s/W to 23.63TOP/s/W,under the TSMC 28nm CMOS technology,which is more than 2 times better than prior arts at data precisions higher than 4 bits.

Keywords/Search Tags:

Convolutional Neural Networks(CNNs), Binary Weight Convolutional Neural Networks(BCNNs), Deep Learning, VLSI Architecture, Energy-Efficient Architecture, Sparse CNNs, Folded Architecture, Precision-Adjustable Architecture

PDF Full Text Request

Related items

1	Neural Architecture Design And Training Method For Efficient Deep Neural Networks
2	Research On Convolutional Neural Network Architecture Search
3	Research On Application Of Deep Learning Models For Feature Representation And Classification
4	A Convolutional Neural Network Accelerator Architecture With Fine-Granular Mixed Precision
5	Research On Deep Learning Based Video Super-Resolution Algorithm
6	A Processor Architecture Design For Large-Scale CNNs Based On TTA
7	Research On Binarization Of Convolutional Neural Network And FPGA Implementation
8	Research On Viewpoint Estimation Based On Deep Convolutional Neural Networks
9	Research On Model Compression Method For Convolutional Neural Network
10	The Research On Parallel Architecture For FPGA-based Convolutional Neural Networks