| In recent years,with the rapid development of deep learning algorithm,deep neural networks(DNNs)have achieved state-of-the-art results in a wide range of machine learning tasks,especially in speech recognition,image recognition and object detection.However,the huge demand for memory and computational power makes DNNs cannot be deployed on embedded devices with limited resources.How to deploy DNNs to embedded devices efficiently has become a major focus in current deep learning researches.The main research of this paper is to build an embedded deep learning system,the system includes research on the model compression and computing acceleration of DNNs on ARMV7 embedded devices,and the main tasks are:First of all,this paper briefly introduces the basic composition of deep neural network,analyzes the network propagation algorithm and related optimization algorithms,and also describes the computing optimization method of SIMD instruction set under ARMV7 embedded platform,providing the technical basis for subsequent research.Secondly,in the study of model compression of embedded deep learning system,aiming at the requirement of intensive storage of DNNs,this paper presents an asymmetric ternary weight quantization method to realize the storage compression of deep neural network models.In the training process,the quantification method quantifies the weight of each layer in the network to the {+ α1,0,-α2} ternary value to realize the discretization of the weights.After the training is completed,the data is compressed and stored by 2-bit coding.Compared with the traditional floating point network,the model storage space can be reduced to about 1/16,which greatly reduces the requirement of DNNs for hardware storage space.The results of model compression experiments show that the recognition rate of the quantized network on CIFAR-10 dataset is 0.33%higher than that of traditional floating-point networks,while the recognition rate on ImageNet dataset is only 0.63%lower,which does not have a greater impact on the accuracy of the network.Finally,in order to speed up the forward propagation of DNNs,embedded deep learning system uses 8-bit fixed-point integer matrix multiplication based on NEON vectorization instruction for lack of fast computing devices such as GPU in embedded systems.At each layer of network processing,the traditional 32-bit floating-point weights are first converted to 8-bit fixed point integers,and the 8-bit fixed point matrix multiplication has lower data bandwidth and faster computation speed to achieve the network layer of the fast calculation,the final results of fixed-point calculations will be restored to floating-point values for the subsequent layer of transmission.Experiments show that the forward computational speed of DNNs after 8-bit fixed-point matrix multiplication is improved by 2-3 times compared with the traditional floating-point processing,which effectively reduces the computation time of DNNs in embedded devices. |