Font Size: a A A

Cross-modal Retrieval Based On Deep Learning

Posted on:2018-07-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:J ShaoFull Text:PDF
GTID:1318330518495981Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Over the last decade there has been a massive explosion of multimedia content on the web. Cross-modal retrieval has become a hot topic of multimedia information retrieval as it fuse various modalities including image, text etc. However, there exist many challenging problems, such as the representations of multiple modalities, the correlative mechanism of cross-modality etc. Therefore, the study of cross-modal retrieval is significant for both real applications and academic research.To address the difficulties of cross-modal retrieval, i.e., building correlation between image and corresponding text effectively, we start with canonical correlation analysis (CCA) and leverage deep learning to conduct in-depth research on cross-modal retrieval. The contributions of this thesis are listed as follows:1. Introducing semantic consistency into correlation learning between image and text. To improve the semantic consistency of the latent space,we expand CCA from 2-view to 3-view and maximize the correlation between image, text and semantic. We propose four kinds of semantics,including supervised category label, unsupervised hypergraph semantic,locality neighborhood and locality preserving. For hypergraph semantic, a fast sampling method for generating hyperedges is proposed. Experiments show that these 4 kinds of semantics are very effective.2. Two methods for overcoming over-fitting: (1) Autoencoder. A reconstruction layer is added after the correlation learning layer to reconstruct intra-modal input. The reconstruction loss back-propagated can be viewed as a regularization term, hence overcome over-fitting. (2)Progressive framework. A linear projection layer is added into traditional framework. The training of linear projection and the training of nonlinear layers are combined to learn better representations. To validate the general applicability of progressive framework, we apply our framework to 3 different applications. Experiments show that progressive framework could provide a better and faster solution for more problems optimized by neural networks.3. Two methods for the optimization of similarity metric: (1) Search-based similarity. In the latent space, traditional CCA-based methods score relevance of image-text pairs directly using certain distance metric.Inspired by PageRank, we propose a search-based similarity measure to score relevance indirectly. (2) Metric learning. We try to transfer metric learning methods for person re-identification into cross-modal retrieval.We use large scale similarity learning (LSSL) for distance measure and propose to construct similar pairs based on the semantic consistency of cross-modal retrieval. Experiments show that both search-based similarity and LSSL are effective and complementary for improving performance.4. Two models competitive with state-of-the-art method based on the above study of semantic consistency, overcoming over-fitting and metric optimization. For unsupervised cross-modal retrieval, we propose an improved CCA (ICCA). Traditional CCA fails to capture the intra-modal semantic consistency and it is hard to learn nonlinear correlation. There exists problem in similarity measure due to the fact that the latent space learned by CCA is not directly optimized with certain distance measure. To address above problems, ICCA introduces locality neighborhood and locality preserving for improving the semantic consistency in the latent space and expands CCA from 2-view to 4-view. To learn nonlinear correlation, 4-view CCA is embed into progressive framework. ICCA employs LSSL to overcome the shortcoming of CCA in similarity metric.We propose a novel two-stage deep learning method (TSDL) for supervised scenario. To maximize the value of semantics, we conduct supervised learning in both stages. In the first learning stage, we introduce reconstruct loss to improve correlation learning. In the second stage, we build a novel fully-convolutional network (FCN), which is trained by joint supervision of contrastive loss and center loss. Finally, we introduce our method for Microsoft Bing image retrieval challenge. Results show that our method is very effective.
Keywords/Search Tags:CCA, semantic modal, autoencoder, progressive framework, metric learning
PDF Full Text Request
Related items