| The vision transformer is a deep learning model that has recently made significant breakthroughs in image classification tasks and has quickly expanded to other computer vision tasks such as object detection,semantic segmentation,and image generation.However,the performance improvement of the vision transformer requires the support of a large amount of training data,which makes it difficult for real visual tasks to meet the data requirements of the vision transformer.In some scientific and medical fields,this situation is particularly prominent due to the difficulty in obtaining a large amount of image data.In addition,due to the large content span of domain data,existing pre-trained weights trained on the Image Net dataset are also difficult to transfer to these domains.Therefore,researching data-efficient vision transformer to reduce its data requirements is of great significance for the practical application of transformers.Current methods for solving the data efficiency problem of transformers focus on changing model parameters and training strategies to improve model training efficiency and generalization ability on one hand;on the other hand,they introduce excellent inductive bias from convolutional neural networks into transformers to improve model learning efficiency.These methods can improve the performance of vision transformers on small image datasets to some extent but are still difficult to compare with advanced convolutional neural network methods.This thesis first analyzes the attention distance distribution of vision transformer’s selfattention heads and studies its performance patterns on different scale datasets.It was found that compared with models trained with sufficient data,vision transformers trained on small datasets lack self-attention heads with local range attention.In addition,unlike recent vision transformer methods based on local window attention,original vision transformers have global range self-attention heads regardless of which dataset they are trained on.Based on these two observations,this thesis proposes a self-attention method that appropriately suppresses long-distance attention called Multi-scale Focal Attention.Experimental results show that this method improves accuracy by 12% compared to baseline Vi T on small image dataset CIFAR but has less than 1% performance loss on medium-to-large dataset Image Net.To solve the problem that multi-scale focal attention does not perform well on mediumto-large datasets,this thesis further studies the impact of attention scale on model training loss and accuracy on CIFAR dataset.Experiments show that short-distance attention models have higher accuracy but long-distance attention models have lower training loss.This partly explains that Vi T with global attention overfits when there is insufficient training data and also shows that Vi T with global attention has higher fitting ability.For this reason,this thesis proposes a Swelling Focal Attention method based on previous work.This method achieves optimal classification accuracy on small datasets CIFAR10 and CIFAR100: 98.32% and83.20%,respectively while achieving comparable performance with existing Vi T on medium-to-large datasets such as Image Net.In summary,this thesis solves the problem of low data efficiency of vision transformers from the perspective of model attention distance.The proposed Swelling Focal Attention enables vision transformers to be effectively trained on datasets of any scale. |