Font Size: a A A

Zero-shot Image Classification Based On Cross-modal Semantic Alignment

Posted on:2020-09-06Degree:MasterType:Thesis
Country:ChinaCandidate:X J YuFull Text:PDF
GTID:2518306518465024Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the development of large-scale image dataset and deep learning approaches,image classification has made remarkable progress.Due to the data-driven and supervised learning characteristics of deep learning,a large amount of well-labeled data are required during the training phase.However,there are tens of thousands of categories in the real world,manual annotation is extremely time-consuming and expensive.Besides,as for some scarce categories,it is often difficult to collect sufficient samples for training.Inspired by humans' ability to recognize the objects from new categories with only some semantic descriptions,Zero-Shot Learning(ZSL)gains huge popularity recently.ZSL aims at handling the unseen categories absent from the training phase based on some auxiliary semantic information,which is a more open and dynamic setup of machine learning.It is typically addressed by resorting to an embedding space to construct effective cross-modal interactions between the visual and semantic modalities for semantic alignment,thereby enabling the knowledge transfer from the seen classes to unseen ones.In this work,based on different cross-modal semantic alignment spaces,we propose two effective models to solve the ZSL task from the perspective of manifold learning and class imbalance learning,respectively.Firstly,under the assumption that the distribution of the class semantic information has an intrinsic manifold structure,we investigate two manifold embedding based approaches ME-ZSL and MCCA-ZSL.Specifically,ME-ZSL employs the manifold constraints in the semantic embedding space from three aspects,i.e.,the intra-class compactness,the inter-class separability,and the locality preserving property in the visual space supervised by the category information.While MCCA-ZSL captures the manifold structure of visual and semantic modalities in a common embedding space,which constrains the inter-class relative distribution with the class semantic relevance and the intra-class relationships of samples based on their similarities.MCCA-ZSL is finally equivalent to a singular value decomposition problem.Both methods explicitly formulate the objective function with manifold embedding and have closed solutions with relatively high efficiency and good interpretability.Extensive experiments on three popular datasets Aw A,CUB and NAB validate the effectiveness of both methods.Secondly,taken visual space as the embedding space,we focus on the class imbalance issue in ZSL and put forward a Semantics-Guided Class Imbalance Learning Model(SCILM).At the class-level,we design a class balanced training process instead of the traditional batch-based training fashion to balance the contribution of samplescarce categories.Specifically,we randomly select the same number of images from each class across all training classes to form a batch during each iteration and align different modalities in the class-level.At the instance-level,we attend on different individual representation ability and synthesize well-represented class visual prototypes guided by the semantic relevances.Extensive experiments on three imbalanced ZSL benchmark datasets demonstrate that SCILM is able to improve the knowledge transfer ability on the sample-scarce categories with a relatively simple network structure.SCILM achieves good performance under both Traditional Zero-shot Learning(TZSL)and Generalized Zero-shot Learning(GZSL)tasks,which provides some potential solutions for the class-imbalanced multi-modal classification tasks.
Keywords/Search Tags:Zero-shot Learning, Image Classification, Cross-modal Alignment, Manifold Learning, Class Imbalance Learning, Multimodal Learning
PDF Full Text Request
Related items