| In recent years,deep learning techniques have made breakthroughs in areas such as autonomous driving,speech recognition,natural language processing,and medical image processing,opening a new era of deep neural network-based intelligent applications in the Internet of Things.As the performance of neural network models gradually increases,the parameter size of these models also gradually scales up,imposing a huge overhead on resource-constrained Io T devices and generating excessive latency in intelligent applications.To reduce the performance consumption of models,researchers have been devoted to designing lightweight deep neural network models,but existing work has encountered two bottlenecks: one is the significant accuracy degradation caused when compressing model parameters,and the other is the difficulty in balancing the resource consumption and performance of models on heterogeneous devices.This paper systematically investigates how to design and deploy lightweight deep neural network models.Specifically,we first design a channel-changeable dynamic neural network architecture,which contains multiple independent subnets with different parameter sizes and inference arithmetic overheads to achieve a trade-off between resource consumption and precision level at runtime.Then,we design an in-place distillation and frozen updating-based training method to enhance the training quality of the dynamic neural network model.Finally,we construct a sliding update-and feature caching-based inference strategy for improving the stability of inference response speed and reducing the redundant computation when switching sub-networks.The main contributions of this paper are as follows:1.We design a channel-changeable dynamic neural network architecture,which contains multiple sub-networks with different performance and can be applied to mainstream deep neural network models.The model consists of a feature extraction part and a classification part.We first build the feature extraction part using the strategy of head convolution sharing and branching incremental concatenation,and then build the feature classification part based on scalable shared neural layers.2.We propose an in-place distillation and frozen updating-based training method.Inplace distillation uses the largest sub-network as the teacher model and guides the network as the student model to learn the features and train the sub-network with better performance.The frozen updating mechanism suppresses the mutual perturbation of different sub-networks in the back-propagation of the shallow layers of the network and improves the overall performance of the model.3.In this paper,a sliding update-based adaptive routing decision maker is constructed.In the inference preparation stage,the decision maker completes the threshold initialization for all sub-networks in the dynamic model with a single round of information entropy collection on the training set,achieving a trade-off between accuracy level and inference computation.In the inference phase,the decision maker instantly adjusts the threshold decisions of each sub-network based on the delay feedback through a sliding update mechanism,thus stabilizing the inference time of the model under different workloads and preset delay requirements and maximizing the model accuracy level.4.We adopt a feature caching-based inference mechanism.This mechanism decouples the forward propagation process of multiple sub-networks,and it eliminates redundant computations in the feature extraction stage of the latter sub-networks by caching the intermediate feature tensor to speed up inference.The experimental results show that the dynamic neural network architecture proposed in this paper can reduce the computational effort by 22.4% to 25.0% and accelerate the inference response by 10.5% to 34.7% compared with the mainstream methods in two public datasets while maintaining similar accuracy. |