| In recent years,the continuous development of deep neural network(DNN)has made various artificial intelligence applications shine in different fields,such as autonomous driving and smart home.However,powerful DNNs are often accompanied by a large number of parameters,making them unable to be effectively and efficiently deployed in resource-constrained devices.Therefore,how to effectively reduce the model size and computational resource consumption of DNNs while maintaining their performance has become an urgent challenge.Model pruning aims to safely remove unimportant connections in neural networks at a small cost of accuracy,and is widely used to compress and accelerate convolutional neural networks(CNNs).Conventional pruning techniques only consider the different accuracy sensitivity between layers but ignore their different latency sensitivity during investigating layer sparsity.One primary problem with this is that an expensive pruning-selecting exploration process is needed to find the high-accuracy and low-latency model.Moreover,prior art in filter pruning applies the static characteristics of the network to determine the filter importance and guide pruning.However,this may result in an inaccurate filter selection and serious accuracy loss.In order to solve the above problems,a latency-aware automated model pruning technology is proposed in this paper,the main components of this technology are shown below:(1)This technology consists of a latency-aware automated framework,which leverages the reinforcement learning to automatically determine the layer sparsity.Latency sensitivity is proposed as a prior knowledge and involved into the exploration loop.Rather than relying on a single reward signal such as validation accuracy or floating-point operations(FLOPs),the agent receives the feedback on the accuracy error and latency sensitivity.Therefore,substructures with better model accuracy and delay can be searched.(2)Moreover,a novel intra-layer filter pruning algorithm is also provided in this technology,which can accurately distinguish the important filters within a layer based on their dynamic changes.The principle behind this algorithm is that more active filters have stronger adaptability to the incomplete network and can compensate for the representation capability of pruned filters.A newly proposed filter regeneration strategy is also included in the algorithm.This algorithm enables more precise intra-layer filter pruning.Compared to the state-of-the-art handcrafted and automated compression policies,this technology demonstrates superior performances for VGGNet,Res Net,and Mobile Net on datasets of CIFAR-10,Image Net,and Food-101.This technology allows the inference latency of Mobile Net-V1 to achieve approximately 1.64 times speedup on the Titan RTX GPU,with no loss of Image Net Top-1 accuracy.It significantly improves the pareto optimal curve on the accuracy and latency trade-off. |