Font Size: a A A

Cross-modal Feature Augmentation For Visual Semantic Understanding

Posted on:2022-11-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:X GuanFull Text:PDF
GTID:1488306764960109Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Fine-grained Visual Classification(FGVC)is a longstanding and fundamental problem in computer vision,and underpins a diverse set of real-world applications.The task of FGVC targets classifying visual objects from subordinate categories,e.g.,species of birds or models of cars.The small inter-class and large intra-class variation inherent to FGVC makes it a challenging problem.At the same time,the high cost of building a fine-grained dataset restricts the deployment of FGVC models in practical application scenarios.To address these three core problems in fine-grained image classification,this paper investigates them from three perspectives: cross-modal method,feature augmentation and interpretability approaches.Specifically,the main content of this dissertation are as follows:(1)According to tackle the problem of insufficient training data in practical application scenarios,a cross-modal feature augmentation method is proposed to identify discriminative regions in image features and decision boundaries for feature classification using cross-modal semantic information,and to generate synthetic features based on fine-grained image features to supplement training data for FGVC models.The method can significantly improve the accuracy of FGVC in multiple few-sample training scenarios,providing a solution for practical applications where the model cannot be deployed due to the unavailability of sufficient training data.(2)To further tackle the overfitting problem caused by insufficient data,a numerical method-based high-order feature augmentation method is proposed to model the manifold distribution of image features using a pre-trained residual neural network,and to generate new features by sampling in the manifold distribution.The method can significantly improve the classification performance of a variety of FGVC models when the training data is severely insufficient.(3)Aiming at the demand of FGVC task for simultaneously extracting global and local detail information,a cross-modal joint training method and cumulative channel reconstruction mechanisms are proposed.By expanding the optimization objective of the image classification model from a single subcategory label to cross-modal semantic information,the ability of the model to extract key discriminative features is improved.Meanwhile,based on the cumulative sum method in sequence learning,unsupervised reconstruction of local and global features is performed during the forward propagation of neural networks to augment the features from early layers containing local detail information in deep neural networks.(4)For the intra-class variances and inter-class similarities of FGVC tasks,a feature fusion method based on complex linear transformation is proposed to utilize the unique many-to-one operational properties of complex values in complex probability to improve the modeling ability of the model for images with large differences in the same subcategory by expanding image features from real vector space to complex vectors and performing complex linear transformation.Further,for the large inter-class similarity in FGVC datasets,the Graph Convolutional Network is used to introduce cross-modal information into the feature classifier and establish the topological relationship between subcategory to further enhance the model's ability to recognize different subcategory images with similar appearance.
Keywords/Search Tags:Feature augmentation, cross-modal learning, interpretability, fine-grained visual classification
PDF Full Text Request
Related items