Font Size: a A A

Research On Deep Neural Network Compression And Acceleration

Posted on:2023-12-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:1528306902953749Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
As the key technology in artificial intelligence,Deep Neural Networks(DNNs)have been widely used in many applications,such as face recognition,video processing,machine translation,search recommendation,biomedicine,etc.However,with the improvement of DNNs’s performance,the width and the depth of neural networks are also increasing,and even super models with hundreds of billions of parameters have been proposed.The high memory footprint and high computational load of deep neural networks are unaffordable for many hardware devices,especially the resource-constrained mobile/wearable devices.Therefore,the compression and acceleration of DNNs has become a hot topic in academia and industry.Besides,many studies have shown that the parameters in DNNs are excessively redundant,and there is also temporal redundancy in the input data for mobile inference,which indicates that model compression and inference acceleration are feasible.To address the problem of parameter redundancy in DNNs and the temporal locality in continuous vision on device,this thesis conducts a series of explorations on model compression and acceleration from the perspectives of network pruning and mobile inference acceleration,especially for CNNs in computer vision.Our main work and contributions are summarized as follows:(1)An Entropy-based method is proposed for network pruning.Conventional works make pruning decisions based on activation values or feature mean vectors,and fail to consider the spatial information in feature maps,which cannot accurately evaluate the feature extraction ability of filters.To address this issue,this thesis proposes an entropy-based method(EFP),in which the entropy is introduced to measure the information in feature maps.Firstly,feature selection module(FS)is constructed to obtain the pruning decisions.To fully consider the feature spatial information and avoid singlesample contingency,FS flattens each feature map channel column and determines their average entropy weight based on a large number of random inputs.Then,as the information distribution varies in different layers,EFP sets a global entropy ratio to determine appropriate pruning ratios for different layers.Next,the compression ceiling is further increased by iterative pruning.Finally,the effectiveness of EFP has been demonstrated with advanced CNNs on several benchmark datasets.Notably,for VGG-16 on CIFAR10,EFP prunes 92.9%parameters and reduces 76%FLOPs without accuracy loss.(2)Weight-dependent gates is proposed for end-to-end network pruning.Pruning indicator,pruning ratio,and efficiency constraint are three main challenging problems in network pruning.Previous works mainly rely on manual-designed indicators or datadriven indicators,which involve human participation or be affected by the input data.The pruning ratio of each layer is usually human-specified,and the redundancy difference of different layers cannot be fully considered.In terms of efficiency constraints,there is an inconsistency between hardware-agnostic metrics and actual efficiency.In this thesis,a simple yet effective network pruning framework is proposed to address the above three challenges in an end-to-end manner.To address the issue of the pruning indicator,weight-dependent gates are introduced to directly learn a mapping from filter weights to pruning gates.To address the issue of the efficiency constraint,a switchable Efficiency Module is constructed to provide latency or FLOPs constraints based on gradients.Furthermore,to make a better accuracy-efficiency trade-off during pruning,an efficiency-aware loss function is defined to optimize the pruning gates and pruning ratios of each layer.Compared with state-of-the-art methods,W-Gates achieves superior performance.For example,W-Gates have demonstrated its effectiveness on ResNet34,ResNet50,and MobileNet V2,respectively achieving up to 1.33/1.28/1.1 higher Top-1 accuracy with lower hardware latency on ImageNet.(3)A novel semantic memory is proposed for boosting mobile CNN inference.Biologic experiments show that human brains are capable of speeding up visual recognition of repeatedly presented objects through faster memory encoding and accessing procedures on activated neurons.For the first time,this thesis borrows and distills such a capability into a semantic memory design,namely SMTM,to improve on-device CNN inference.SMTM employs a hierarchical memory architecture to leverage the temporal locality and the long-tail distribution of objects of interest,and further incorporates several novel techniques to put it into effects:1)it encodes high-dimensional feature maps into low-dimensional semantic vectors for low-cost yet accurate cache and lookup;2)it uses a novel metric in determining the exit timing considering different layers’ inherent characteristics;3)it adaptively adjusts the cache size and semantic vectors to fit the scene dynamics.SMTM is prototyped on commodity CNN engine and runs on both mobile CPU and GPU.Extensive experiments on large-scale datasets and models show that SMTM can significantly speed up the model inference over standard approach(up to 2x)and prior cache designs(up to 1.5x),with acceptable accuracy loss.
Keywords/Search Tags:Deep Neural Networks, Network Pruning, Inference Acceleration, Entropy, Weight-dependent Gates, Latency Prediction, Semantic Memory
PDF Full Text Request
Related items