| Fine-grained image classification primarily aims to distinguish multiple subclasses of objects in the same broad category,such as dog breeds and clothing styles.It has a wide range of application scenarios in areas such as biodiversity detection,climate change assessment,smart retail,and transportation.Compared with coarse-grained image classification tasks,fine-grained images have more stringent and demanding classification requirements due to their characteristics of significant intra-class differences and small inter-class differences,which make it challenging to discover the most critical features in images.Earlier strongly-supervised fine-grained image classification methods relied on manually labeled bounding boxes and critical points to limit their usefulness.In contrast,weakly supervised fine-grained image classification methods require only image-level labeling,which is more scalable and practical.The existing weakly supervised fine-grained image classification methods use special localization,and multiple image zooms to improve the model’s classification performance.However,such an operation requires many computational resources,and the processing is complicated.Therefore,after an in-depth study,this paper proposes a weakly supervised localization fine-grained image classification model based on multi-branch attention and a fine-grained image classification model fusing multi-level features.The main research of this paper contains the following two points.(1)The existing weakly supervised fine-grained image classification models based on weak supervision have problems such as complicated operation and low accuracy.To address these problems,this paper proposes a weakly supervised localization fine-grained image classification algorithm based on multi-branch attention,using a two-branch structure to reduce the number of zooms while improving the classification accuracy of the model.First,we introduce a localization enhancement module to suppress the significant features of the feature map output from the convolutional neural network to improve the accuracy of weakly supervised localization.Second,the attention mechanism module is used to assist the convolutional neural network to better identify the classification regions in fine-grained images.Third,another KNN attention-based classification module is proposed to improve the classification accuracy,which inputs the localized target object images into the KNN attention-based Swin Transformer network to obtain the final classification results.Experimental results on three publicly available datasets,CUB-200-2011,Stanford Cars,and Stanford Dogs,show that the proposed weakly supervised localization fine-grained image classification algorithm based on multi-branch attention can improve the classification accuracy compared with more advanced existing methods.(2)Although the weakly supervised localization fine-grained image classification algorithm based on multi-branch attention solves the problems of existing weakly supervised methods such as large computational resource consumption due to special localization and multiple zooming image operations.However,this multi-branch structure is more complicated to operate.Therefore,to simplify the network,this paper proposes an end-to-end training model that fuses multi-level features to improve the accuracy of fine-grained image classification by fusing the advantages of Transformer and convolutional neural networks at multi-scale levels.It is improved in the following two aspects: first,to better acquire local and global information in fine-grained images to improve the classification accuracy;Transformer network and convolutional neural network are used to obtain global and local information,respectively.Second,the feature fusion module is proposed to adaptively fuse global and local features under different layers by contextual attention and channel attention mechanisms.Comparative experimental results on three mainstream fine-grained image datasets show that the algorithm can achieve better classification accuracy than most existing algorithms in fine-grained image classification tasks and reduce the algorithm’s complexity. |