Font Size: a A A

Fine-grained Image Recognition Based On Deformable Transformer And Multi-Scale Attention

Posted on:2024-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y MaFull Text:PDF
GTID:2568307142466264Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,the rapid development of deep learning and computer technology has promoted the progress of image classification technology,which has been widely applied in various fields and attracted extensive attention from both industry and academia.Fine-grained image classification is an important branch of image classification.Compared with coarse-grained image recognition technology,it is more challenging and requires more indepth research.Generally,image classification refers to the recognition of different categories,with significant differences between different types of images.Fine-grained image classification requires the recognition of different sub-categories within the same category.Due to the small differences between fine-grained images and large differences within the same category,the key to fine-grained image classification is to accurately locate regions with distinguishing features and extract features from these regions to improve recognition accuracy.In brief,fine-grained image classification needs to address the problem of how to accurately locate important regions that distinguish different categories and extract useful features from these regions as the basis for classification.In this thesis,we conducted research on fine-grained image classification algorithms based on deep learning methods.The effectiveness and practicality of the proposed methods were validated through experiments on relevant datasets,and comprehensive comparative analysis was performed.The main work of this thesis includes:(1)Compared to Convolutional Neural Networks(CNN),the advantage of Vision Transformer(ViT)lies in its better capture of global information,which can compensate for the limitation of CNN’s dependence on global information.While CNN is able to capture local information,its fixed structure and limited ability to model geometric information restrict its performance.To address this issue,this thesis proposes a Deformable Transformer structure that combines deformable convolutions and Swin Transformer layers.Deformable convolutions adaptively adjust their kernel shapes to better fit features of different scales and shapes,reducing interference from background information and capturing discriminative regions.The ViT captures long-range dependencies in images,and combining the two models can improve recognition accuracy.This method also fuses feature maps from each stage,enriching information by combining low-level and high-level features.The proposed approach uses cross-entropy loss and comparative loss,and has been experimentally evaluated on three commonly-used fine-grained image datasets.Results show that this approach improves fine-grained image recognition accuracy compared to other methods.(2)The self-attention mechanism in ViT requires converting twodimensional images into one-dimensional sequence inputs,which disrupts the key two-dimensional structure of the image and its inherent secondary computational complexity is more complex for high-resolution images.To address this issue,this thesis proposes a DD3-MSANet network,whose core content is the construction of a Multi Scale Attention Module(MSA).The MSA module includes deep convolution components,dilation convolution components,and 3D convolution components.Firstly,local information is obtained through deep convolution and the number of parameters is reduced.Secondly,by expanding convolution,the Receptive field of the convolution kernel is increased while the number of parameters is kept unchanged,so that each convolution can output a large range of information to obtain context information.Then,multi-scale features with different expansion rates are added through 3D convolution to enrich feature information.Finally,highlight the area of interest through channel attention.This method was tested on three commonly used fine-grained image datasets,and the results showed that the DD3-MSANet network can effectively improve the accuracy of classification,while reducing the number of parameters and computational complexity.The practical value of this method can be applied to various tasks that require processing high-resolution images.
Keywords/Search Tags:Fine Grained Image Recognition, Convolutional Network, Transformer, Contrastive Loss Function, Multiscale Attention
PDF Full Text Request
Related items