Research On Data Efficient Vision Transformer Network

Posted on:2024-02-10

Degree:Master

Type:Thesis

Country:China

Candidate:B Chen

Full Text:PDF

GTID:2558307181954219

Subject:Electronic Information (in the field of computer technology) (professional degree)

Abstract/Summary:

PDF Full Text Request

The vision transformer is a deep learning model that has recently made significant breakthroughs in image classification tasks and has quickly expanded to other computer vision tasks such as object detection,semantic segmentation,and image generation.However,the performance improvement of the vision transformer requires the support of a large amount of training data,which makes it difficult for real visual tasks to meet the data requirements of the vision transformer.In some scientific and medical fields,this situation is particularly prominent due to the difficulty in obtaining a large amount of image data.In addition,due to the large content span of domain data,existing pre-trained weights trained on the Image Net dataset are also difficult to transfer to these domains.Therefore,researching data-efficient vision transformer to reduce its data requirements is of great significance for the practical application of transformers.Current methods for solving the data efficiency problem of transformers focus on changing model parameters and training strategies to improve model training efficiency and generalization ability on one hand;on the other hand,they introduce excellent inductive bias from convolutional neural networks into transformers to improve model learning efficiency.These methods can improve the performance of vision transformers on small image datasets to some extent but are still difficult to compare with advanced convolutional neural network methods.This thesis first analyzes the attention distance distribution of vision transformer’s selfattention heads and studies its performance patterns on different scale datasets.It was found that compared with models trained with sufficient data,vision transformers trained on small datasets lack self-attention heads with local range attention.In addition,unlike recent vision transformer methods based on local window attention,original vision transformers have global range self-attention heads regardless of which dataset they are trained on.Based on these two observations,this thesis proposes a self-attention method that appropriately suppresses long-distance attention called Multi-scale Focal Attention.Experimental results show that this method improves accuracy by 12% compared to baseline Vi T on small image dataset CIFAR but has less than 1% performance loss on medium-to-large dataset Image Net.To solve the problem that multi-scale focal attention does not perform well on mediumto-large datasets,this thesis further studies the impact of attention scale on model training loss and accuracy on CIFAR dataset.Experiments show that short-distance attention models have higher accuracy but long-distance attention models have lower training loss.This partly explains that Vi T with global attention overfits when there is insufficient training data and also shows that Vi T with global attention has higher fitting ability.For this reason,this thesis proposes a Swelling Focal Attention method based on previous work.This method achieves optimal classification accuracy on small datasets CIFAR10 and CIFAR100: 98.32% and83.20%,respectively while achieving comparable performance with existing Vi T on medium-to-large datasets such as Image Net.In summary,this thesis solves the problem of low data efficiency of vision transformers from the perspective of model attention distance.The proposed Swelling Focal Attention enables vision transformers to be effectively trained on datasets of any scale.

Keywords/Search Tags:

Small Dataset, Vision Transformer, Image Classification, Self-attention

PDF Full Text Request

Related items

1	Research On Image Classification Algorithm Base On Vision Transformer
2	Research On Image Multi-level And Multi-classification Method Based On Visual Transformer
3	Research On Image Classification Of Small-scale Dataset
4	Optimization Study Of Transformer Model Based On Image Classification
5	Research On Micro-expression Dataset Establishment And Recognition
6	Research On Dataset Distillation Algorithm For Face Image Classification
7	Design And Implementation Of Image Classification Algorithm Based On Multivariate Unbalanced Small Dataset
8	Research On Multi-view Image Classification Method Based On ViT
9	Deep Hash Image Retrieval Based On Vision Transformer
10	Research On Text Classification Model And Algorithm For Small Dataset