| Fine-grained image classification is an important research direction in the field of computer vision,which aims to accurately distinguish different subcategories within the same basic category,such as distinguishing between black-footed and Laysan albatrosses in the bird category.This technology has wide-ranging applications in civilian and military fields,including animal and plant conservation,disease diagnosis,industrial quality inspection,weapon recognition,and ship recognition,among others.In recent years,fine-grained image classification methods have gradually shifted from relying on a large amount of additional labelled information such as bounding boxes and attribute information for supervised learning to weakly supervised learning solely based on class labels,achieving better recognition performance.However,these methods usually only utilise a single source of feature representation for local region localisation,leading to difficulties in neural network optimisation and poor generalisation performance.Information fusion-based classification methods can integrate information from different sources,features,dimensions,and modalities,enhancing the model’s generalisation ability and playing an important role in machine learning.Therefore,to enhance the generalisation ability and applicability of fine-grained classification algorithms in complex scenarios,this thesis first starts with the fusion of single-modal information and improves the model performance and reduces optimisation difficulty through the fusion of multi-feature channels.On this basis,the training data is expanded from a single dataset to multiple datasets,and the label space is expanded from single-granularity classification to multi-granularity classification.Two finegrained image classification algorithms are constructed based on multi-dataset information fusion and multi-granularity information fusion,respectively,which improve the discriminability of deep features across datasets and the feature representation ability in different granularity spaces,and enhance the recognition accuracy and generalisation ability of the model.Subsequently,this thesis further expands the fusion of single-modal information to the fusion of multi-modal information,and constructs a fine-grained image classification algorithm based on multi-modal information fusion,which improves the knowledge transfer and metric ability of the expert system.The contributions of this thesis can be summarised as follows:(1)We propose a multi-feature channel information fusion-based fine-grained image classification algorithm.Our algorithm enables the model to implicitly learn fine-grained features,which improves classification accuracy while reducing the difficulty of model optimisation.In terms of method implementation,we differentiate the features of different channels instead of treating the feature map as a whole and constraining their distribution through a loss function.The proposed mutual channel loss contains two channel-related components:a discriminative component and a diversity component.Through end-to-end training,the mutual channel loss drives the model to mine highly discriminative local features in the inference stage,using only category labels.We validate the effectiveness of our proposed method on four widely used fine-grained image classification datasets and conduct detailed ablation experiments to verify the effectiveness of each component.(2)We propose a multi-dataset information fusion-based fine-grained image classification algorithm,which promotes positive transfer between datasets while suppressing negative transfer.The proposed algorithm can be trained on multiple fine-grained image classification datasets,and efficiently and accurately recognise fine-grained labels in different coarse-grained feature spaces.Specifically,we introduce a feature decoupling module and a feature re-fusion module to eliminate negative transfer between different datasets and promote positive transfer.Additionally,since the ability of discriminative region localization is dataset-agnostic,we propose a meta-learning-based dataset-agnostic spatial attention layer,which can leverage the advantage of increased training data due to dataset mixing.Experimental results on 11 different mixed datasets constructed from four fine-grained datasets demonstrate that the proposed method can efficiently fuse information from multiple datasets and improve the model’s generalisation ability.(3)We propose a multi-granularity information fusion-based fine-grained image classification algorithm,which efficiently decouples the promotion and inhibition relationships between coarse and fine granularity.Our experiments reveal that coarse-grained image classification suppresses the learning of fine-grained features,while fine-grained image classification promotes the learning of coarse-grained features.This finding inspires a simple solution that effectively solves the multi-granularity image classification problem:(ⅰ)separating coarse-grained features from fine-grained features by using multiple classification heads with different granularities,(ⅱ)allowing finer-grained features to participate in the decision-making of coarser-grained classifiers,thereby strengthening the performance of coarse-grained classifiers,but not allowing them to optimise finer-grained features.Experimental results demonstrate that the proposed method can effectively fuse information from different granularities and achieves excellent performance in multi-granularity classification tasks.(4)We propose a multi-modal information fusion-based fine-grained image classification algorithm and develop a transferable knowledge extraction framework and evaluation method.Our approach utilises a multi-stage learning framework that models the knowledge held by experts and novices separately,and then distils expert-exclusive knowledge through knowledge distillation for transferability.Additionally,we simulate a process where a comprehensive encyclopedia assists individuals in object recognition to evaluate the practical impact of different types of transferable knowledge.We also propose a quantitative metric called Transferable Effective Model Attention(TEMI),which evaluates algorithm performance directly through model output.Through a human study,we validate the proposed approach for efficient multi-modal information fusion and continuous improvement of finegrained object recognition across different populations.Furthermore,we confirm the effectiveness of the proposed measurement method,TEMI,which enhances the practicality and generalisability of our approach. |