Font Size: a A A

Optimizations And Implementations For Key Components Of Deep Neural Networks

Posted on:2021-01-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z D QinFull Text:PDF
GTID:1488306500966769Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Encountering the famous Von-Neumann bottleneck,computing systems based on the classic computing architecture cannot satisfy the growing demand for more energy ef-ficient,high-performance and low-power hardware in emerging application scenarios such as artificial intelligence.The energy efficiency of processing deep neural networks(DNNs)in embedded devices is severely limited by many issues such as memory wal-l.The development of DNN models shows three trends:The computing and storage complexity of networks are getting larger;More types of operations are included in the models;Weights and activations of networks exhibit varying degrees of sparsity.Targeting the properties of algorithms,this paper develops the research on optimiza-tions of hardware architectures from the following aspects.For sparse DNN models with convolutions and matrix-vector multiplications,we aim to design a flexible archi-tecture which can exploit the sparsity to improve performance;For storage-intensive fully-connected layers(FCLs),we strive to reduce the memory access by compressing network parameters and optimizing the scheme of decoding;For nonlinear function-s,the optimized implementation methods based on binary computing and stochastic computing are explored.The main results are as follows.Firstly,a reconfigurable implementation method of DNNs based on multi-mode sparse matrix-vector multiplications(MVMs)is proposed.For networks which contain sparse convolutions and sparse MVMs,previous methods focus on fixed types of networks,lacking flexibility.Some other methods based on heterogeneous computing arrays do not have resource-reusability.In this paper,to exploit the sparsity,the processing elements(PEs)are designed by dealing with the weights in a compressed format and operating on non-zero activations and weights.Varied sizes of MVMs and convolutions can be mapped to the PE array flexibly.To improve the parallelization and data reuse of convolutions,multi-mode strategies for array grouping and data scatter-ing are adopted.Based on above methods,the corresponding hardware architecture is designed,where multiple types of networks can fully reuse the computing and storage resources.According to the experimental results,compared with working directly on an uncompressed network,the throughput(GOPS)can be improved by up to 6×.Secondly,a model compression method based on structured matrix and low-precision quantization as well as the corresponding hardware architecture are proposed for FCLs.Due to large number of parameters in FCLs,the processing performance of FCLs can be severely limited by the memory bandwidth,although they may not have largest proportion of computations in models.Especially when the network scale is large,current methods based on sparsity have the issue of excessive decoding overhead,limiting the improvement of energy efficiency.In this paper,the weight matrix is constrained to block-circulant matrices and each weight is quantized to an integer power of two during the training process.In this way,a large number of mul-tiplications can be converted into shift operations,while the block-circulant matrices can be compressed as vectors,which can be decoded by cyclic shift.As a result,com-putational complexity and memory-access overhead can be significantly reduced.The experimental results show that the above method can compress the FCLs in Alex Net by128×.The energy efficiency of the hardware architecture can achieve 5.3 TOPS/W.Thirdly,for the sigmoid function and the tanh function,we propose approx-imation and implementation methods based on equal-interval segmentation and Taylor series expansions.Sigmoid and tanh are typical nonlinear functions in DNNs.They are difficult to implement directly on hardware because of containing complex ex-ponential operations.Traditional piecewise linear approximation methods are not opti-mized specially for tanh and sigmoid.Thus,we present novel approximation schemes based on Taylor series expansion,and optimize the hardware architecture by exploiting the properties such as regularity of function's values at nln2.Eliminating the use of look-up tables and multipliers,the proposed circuits use purely combinational logic to complete the calculations.Compared with the traditional piecewise approximation,the critical path can be reduced by 29%,and the area can be reduced by 52%.Finally,low-complexity stochastic computing(SC)circuits compatible with the sigmoid function and the tanh function are presented.SC circuits have the prop-erties of low complexity and low power.However,traditional approximation methods require complex architectures to implement nonlinear functions,which cannot leverage the advantages of SC.Based on the piecewise approximation and the 1st order Taylor series expansion,implementations for sigmoid and tanh are optimized according to the properties of stochastic computing.Without extra precision loss,the optimized strate-gies can significantly reduce the complexity of SC circuits.In addition,combining the monotonicity of functions,the above method is extended to multiple types of nonlinear functions.The experimental results validate that the approximation precision of non-linear functions can achieve the order of 1×10-3.Compared with traditional methods,for sigmoid and tanh,the area can be reduced by more than 30%,and the critical path can be reduced by more than 40%.
Keywords/Search Tags:Deep neural networks, Algorithm and hardware co-design, VLSI design, Sparsity, Fully-connected layers, Nonlinear functions, Stochastic computing
PDF Full Text Request
Related items