Font Size: a A A

VLSI Optimizations And Implementations For Convolutional Neural Networks

Posted on:2020-03-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z WangFull Text:PDF
GTID:2428330575955107Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Convolutonal Neural Networks(CNNs)based deep learning algorithms have al-ready achieved great success in several fields,such as image classification,motion/speech recognition,etc.However,while CNNs show their impressive performances,massive computational complexity and memory requirement prevent them from being widely used.As Moore's law is slowing down,it's impossible to obtain more computation easily by only technology scaling,which raises big challenges to build a CNN based embedded system since it has limited hardware resources and need very high energy ef-ficiency.In this paper,we try to solve this problem by building specific VLSI hardware architectures for embedded systems featured with CNNs.On one hand,an energy efficient hardware architecture for Binary weight CNNs(BCNNs)is proposed.BCNNs compress their parameters to only +1 and-1 to reduce the model sizes by an order of magnitude,while they can still maintain acceptable ac-curacies for certain applications.Thus,BCNNs make it possible to achieve very high throughputs with moderate power consumption,and are very suitable for embedded systems which should be extremely energy efficient.The BCNN hardware architec-ture proposed in this paper fully exploits model's binary weights and other hardware-friendly characteristics.It schedules the processing flow judiciously to minimize off-chip DRAM access and maximally reuse all input activations so that the memory power is reduced significantly.Besides,several micro-architectural level innovations are in-corporated,e.g.,dedicated compressor trees and approximate computing with com-pensation schemes.As a result,the overall hardware complexity and critical paths decrease.Experimental results on several datasets demonstrate insignificant loss on model accuracies.Postlayout implementation results show that the thoroughputs and energy efficiency of this architecture are 4.14x and 2.37x of that in prior art,respec-tively.On the other hand,targeting at embedded applications that requires both flexibil-ities in data quantization/computation precisions and high energy efficiency,this pa-per proposed another precision-adjustable sparse CNN hardware architecture,namely Folded Precision Adjustable Processor(FPAP).One of the most challenging parts in designing an efficient hardware architecture for CNNs is to eliminate large amount of unnecessary computations caused by the sparsity associated with both weights and ac-tivations.Another challenge is to enable the designed hardware to adapt to variable data precision requirements across layers.Unfortunately,existing works can only re-solve one of these two design challenges.The FPAP architecture eliminate all com-putational redundancies while achieving high processing speed by exploring globally parallel and locally serial that is based on folding transformation.Combined with a ded-icated data encoding scheme,the dominating Multiply-Accumulate(MAC)operations(OPs)and 1-D convolutions(FIR filters)are decomposed into multiple adds.These adds are folded into single arithmetic units and then calculated serially.In this way,FPAP is able to exclusively perform effective computations.It is capable of adapting to variable data precisions while exploiting the sparsity of both weights and activations to further reduce computational complexity.Besides,this paper also explores to re-move the fine-grained redundancies brought by add-zero OPs within already reduced-precision MACs,and to minimize the overall workloads by dynamically deciding to decompose either activation or weight in a per-MAC granularity.Additionally,to mit-igate the load imbalance issue caused by irregular sparsity,a novel genetic algorithm based kernel reallocation scheme is introduced to significantly improve the throughput.Experimental results on real CNN models demonstrate that the proposed architecture can achieve very high energy efficiency ranging from 4.28TOP/s/W to 23.63TOP/s/W,under the TSMC 28nm CMOS technology,which is more than 2 times better than prior arts at data precisions higher than 4 bits.
Keywords/Search Tags:Convolutional Neural Networks(CNNs), Binary Weight Convolutional Neural Networks(BCNNs), Deep Learning, VLSI Architecture, Energy-Efficient Architecture, Sparse CNNs, Folded Architecture, Precision-Adjustable Architecture
PDF Full Text Request
Related items