Font Size: a A A

Research On Fine-Grained Visual Classification Based On Compact Vision Transformer

Posted on:2024-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:H XuFull Text:PDF
GTID:2558307181454254Subject:Electronic Information (in the field of computer technology) (professional degree)
Abstract/Summary:PDF Full Text Request
Fine-grained Visual Classification(FGVC)refers to the task of distinguishing between different subclasses within a given target category,with small inter-class differences and large intra-class differences.The core problem lies in how to accurately locate regions with discriminative power in the classification network,in order to enhance its ability to capture subtle differences,which poses a greater challenge than traditional classification.In recent years,Vision Transformer(ViT)has demonstrated great performance in traditional computer vision fields such as classification,by leveraging the self-attention mechanism to obtain global attention information.However,it still lacks the ability to further distinguish subtle differences in FGVC tasks.Moreover,there is a high demand for large amounts of data and high computational complexity in ViT,which requires advanced computing devices and is often difficult to train,resulting in unstable performance.In this thesis,we conducted indepth research on the adaptation and performance of ViT in fine-grained visual classification,and proposed a compact ViT structure to address the issues of large data requirements and high computational complexity.We innovatively applied this structure to FGVC tasks,achieving leading performance.The main research content of this thesis is as follows:(1)Research on Fine-grained Visual Classification based on ViT.We conducted an analysis of the problems in traditional fine-grained visual classification methods based on convolutional neural networks,and investigated the adaptability of the ViT structure in the field of fine-grained visual classification.Based on the Vision Transformer structure,we improved and used a part selection module,which calculates the most distinctive region in the embedding vector and removes redundant information,and inputs it into the last encoding layer to enhance the model’s ability to capture subtle differences.We also added a mixed loss function that includes contrastive loss and cross-entropy loss,which uses patch representations in the early encoding layers and regularizes the patches in the deeper layers to reduce the similarity of patch representations and further improve model performance.Experimental results show that the proposed modules effectively improve the classification performance of the ViT structure in FGVC tasks.(2)Research on Fine-grained Visual Classification based on Compact ViT.To address the issues of large data requirements and high computational complexity in ViT structure for fine-grained visual classification,we used multiple convolutional blocks to generate model inputs,retaining more low-level information and inductive bias.The model can efficiently extract low-dimensional features by obtaining more information compared to direct segmentation.We also used a sequence pooling module,which removes the class token and allows the encoder to focus only on performing self-attention in patches,further improving model performance and significantly reducing model complexity.This eliminates the dependency on large-scale data and enables the model to converge quickly in a shorter amount of time.Furthermore,we proposed a fine-grained visual classification network model based on the compact Vision Transformer.In this model,we reduced the dependency on data volume by using multiple convolutional blocks to generate model inputs,and reduced the computational complexity by using sequence pooling technology to remove the classification token.Finally,we further improved the model’s performance in fine-grained visual classification by using the part selection module and mixed loss function.Experimental results demonstrate that the proposed model has significant advantages in terms of computational complexity and data requirements.With fewer data and limited computational resources,the structure can produce superior results(88.9%,87.4%,89.0%,93.4%,88.0% respectively)on common datasets CUB-200-2011,Butterfly200,Stanford Dogs,Cars and NABirds.The training time decreased by 73.8%compared to ViT-B_16 and 93.9% compared to Trans FG averagely,however the parameters were only roughly one-fourth of both.Experiments prove that the model proposed is superior to other well-liked methodologies in terms of data requirements and computational complexity.
Keywords/Search Tags:compact, fine-grained visual classification, Vision Transformer, inductive bias
PDF Full Text Request
Related items