| Benefiting from the self-attention module,the Transformer architecture has shown remarkable performance in many computer vision tasks.Compared with the mainstream convolutional neural network,visual transformer usually has a complex structure for extracting strong feature representation.Although the network performance has been improved,it usually requires more computing resources and is more difficult to develop on mobile devices.This study discusses the feasibility of implementing model compression algorithm on vision transformer,taking three major model compression algorithms-network pruning,parameter quantization and knowledge distillation as the starting point:simplifying the structure of the original network through pruning,representing high-precision input and weight parameters as low precision through quantification,and transferring the effective information in high-precision large network to lightweight network through knowledge distillation.By using the above method,compress the visual transformer for deployment on hardware platforms such as FPGA.The main work of this thesis includes the following three aspects:1.A vision transformer pruning method is proposed.This method can identify the influence of dimensions in each layer of the transformer,and then perform corresponding pruning.The important dimensions are automatically reflected by promoting the sparsity of transformer dimensions.In order to obtain a higher pruning ratio,a great number of dimensions with smaller importance scores are pruned without significantly reducing the accuracy.The pipeline for vision transformer pruning is as follows:1)training with sparse regularization;2)pruning the dimension of the feature according to the predefined pruning ratio;3)finetuning.The parameters and floating-point operations per second of the proposed algorithm are evaluated and analyzed on the ImageNet dataset,which verifies the effectiveness of the method.2.An effective post-training quantization algorithm for vision transformer is proposed.The quantization task can be regarded as finding the optimal low-bit quantization intervals for weights and inputs respectively.In order to maintain the function of the attention mechanism,ranking loss is introduced into the traditional quantitative goal,aiming at maintaining the relative order of the self-attention results after quantization.The relationship between quantization loss of different layers and feature diversity is analyzed in depth,and the hybrid precision quantization scheme is explored using the kernel norm of each attention map and output feature.The effectiveness of the proposed method is verified on several benchmark models.For example,after 8-bit mixed precision quantization,the DeiT-Base model can be used on the ImageNet dataset to achieve 81.29%of Top-1 accuracy,which is superior to the most advanced post-training quantization algorithm.3.A vision transformer knowledge distillation method based on image patch-level manifolds is explored.This method can simultaneously calculate the patch-level manifold space within the inter-image,intraimage and randomly sampling in the teacher and student models,and mine useful information from the teacher transformer through the relationship between the image and the segmented patch.The experimental results on several baselines prove the superiority of the proposed algorithm in extracting portable transformer models with higher performance.For example,the DeiT-Tiny model can be used on the ImageNet dataset to achieve 75.06%of Top-1 accuracy,which is better than the existing vision transformer distillation method. |