Fine-grained image classification(FGIC)is an important research direction in computer vision and pattern recognition,which aims to study the image recognition of different sub-categories at the fine-grained under a certain traditional semantic category.In many scenarios such as ecological and environmental protection,intelligent transportation,intelligent security has important scientific significance and application values.FGIC is characterized by small inter-class and large intra-class differences between subclasses of images,the classification model needs to be able to capture subtle local regions and learn the most distinguishing features.The existing algorithms have insufficient localization for local regions and underutilization of multi-scale local features.To address these problems,this thesis designs fine-grained image classification networks based on deep learning.The main contents of this thesis are summarized as follows:(1)To address the problem of insufficient localization for local regions and underutilization of multi-scale local features,the combining multi-scale attention mechanism and knowledge distillation network for fine-grained image classification is proposed.Firstly,the scale differences of the output features at different stages of the backbone network are used to extract multi-scale features.Then,the dual attention mechanism module is applied for refinement feature screening at different scales,enhancing the representation of locally distinctive features at different scales.At the same time,to obtain full use of the multi-scale local feature complementarity,the features fusion module is introduced to map the screened multi-sale features to the same feature space and fuses them.Finally,the knowledge distillation module based on global output responses is proposed to improve the representation of diverse fine-grained features,which uses the global output values of a trained teacher network to supervise the student network to better learn regional fine-grained features on multi-scale branches.Extensive experimental results verify that the proposed network model on fine-grained image benchmark datasets is highly competitive compared to existing methods(2)To address the problem of visual transformer only focusing on global information and unable to generate multi-scale fine-grained classification features,feature-token select and fusion Transformer method is proposed.In order to capture local information and interlayer of complementary information for fine-grained images,firstly,the feature-token select module with adaptive thresholding is constructed to mine the required discriminative of patch tokens using the self-attention weight map generated during the training of the network itself without introducing additional parameters,thus removing background interference and improving the ability to capture the discriminative local features of the network.Then,the feature fusion module is introduced to aggregate the class token between the different layers and all the discriminative patch tokens as fused features,which are fed into the last layer of the Transformer encoder,so that the class token output from the last encoder layer learn to aggregate global and local features of fine-grained images.Finally,the classification integration module based on multi-layer feature information is introduced to integrate class tokens from the last three layers of the Transformer encoder and sends them to the classifier for prediction,and the classification network integrates the complementary information among different layers to enhance the final classification effect of the model.Extensive experimental results verify that the proposed network model on fine-grained image benchmark datasets is highly competitive compared to existing methods. |