Font Size: a A A

Deep Learning For Cross-Modal Retrieval

Posted on:2016-02-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:F X FengFull Text:PDF
GTID:1108330482460398Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of mobile Internet, everyone is free to post, transmit and receive messages almost anywhere at anytime. The messages usually contain data in multiple modalties, such as text, speech, image and video. The presence of massive multi-modal data on the Internet brings huge requirements of cross-modal retrieval, such as using an image query to search for texts and using a text query to search for images. Traditional single-modality information retrieval technology, such as using a text query to search for text, can not solve the cross-modal retrieval problem well. Therefore, the study of cross-modal retrieval has great significance both practical applications and academic research.Over recent years deep learning have made significant progress in a wide variety of fields, including image, speech, and natural language processing. It demonstrates the versatility to deal with data from different sources. Its structural similarity on processing data in different modalities, as well as its coding capacity in layer-wise scheme, makes deep learning a strong tool to build models for multi-modal information retrieval.This thesis focuses on the cross-modal retrieval tasks between image and text modalities. Based on the intensive study of multi-modal informantion retrieval and the extensive analysis of existing research work, a seriese of deep learning model suitable for cross-modal retrieval are presented in this thesis. These models are evaluated on several publicly available data sets from real scenes. More specifically, the main contributions of the thesis are presented as follows.A correspondence autoencoder (Corr-AE) is first proposed. Based on Corr-AE, a deep learning model for cross-modal retrieval is then built. Corr-AE is constructed by correlating hidden representations of two uni-modal autoencoders. A novel optimal objective, which minimizes a linear combination of representation learning errors for each modality and correlation learning error between hidden representations of two modalities, is used to train the model as a whole. Minimization of correlation learning error forces only common information in different modalities to be included in hidden representations, while minimization of representation learning error makes hidden representations are good enough to reconstruct input of each modality. Corr-AE is evaluated on three publicly available data sets from real scenes. Expereimental results demonstrate that Corr-AE performs significantly better than one model involving canonical correlation analysis and two popular multi-modal deep models on cross-modal retrieval tasks.Based on Corr-AE, one group of multi-modal reconstruction Corr-AEs and one group of uni-modal reconstruction Corr-AEs are proposed respectively. In these models, the correlation constraint in Corr-AE still remains; however, the reconstruction parts are redesigned. Multi-modal reconstruction Corr-AEs reconstructs both the image and text modalities; and uni-modal reconstruction Corr-AEs only reconstructs one modality, image or text. These models are also evaluated on three publicly available data sets from real scenes. Expereimental results demonstrate that these different reconstruction designs not only provide us more choices on implementing cross-modal retrieval task, but also give us a clearer picture on how Corr-AE works.A correspondence restricted Boltzman machine (Corr-RBM) is proposed. Based on Corr-RBM, two deep learning models for cross-modal retrieval are then built. Corr-RBM is constructed by two uni-modal RBMs. Similar to Corr-AE, a correlation constraint is introduced on the representation layers of the two RBMs. A single objective function is constructed to trade off the correlation loss and likelihoods of both modalities. Through the optimization of this objective function, Corr-RBM is able to capture the correlations between two modalities and learn the representation of each modality simultaneously. Different to Corr-AE, Corr-RBM introduce two parameters to tradeoff the likelihoods of the two modalities in order to show the different importance of the two modalties for learning the common representation space. Based on Corr-RBM, two deep learning models are built: Corr-DBN and Stacked Corr-RBMs. The former learns the cross-modal correlation at the topmost layer and the later learns at each layer. These models are also evaluated on three publicly available data sets from real scenes. Expereimental results demonstrate that the Stacked Corr-RBMs performs significantly better than several state-of-the-art cross-modal retrieval models.A prototype system of cross-modal retrieval based on deep learning models is designed and developed. Based on the correspondence models proposed in this thesis, a prototype system of cross-modal retrieval between clothing image and text is developed. This system provides two functions:one is to return the relevant text list when users upload one clothing image; the other is to return the relevant clothing image list when users input several words about cloth.
Keywords/Search Tags:cross-modal retrieval, deep learning, autoencoder, restricted Boltzmann machine
PDF Full Text Request
Related items