Font Size: a A A

Research On Supervised Learning For Cross-modal Retrieval

Posted on:2021-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y WuFull Text:PDF
GTID:2428330614963601Subject:Control engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of cloud computing and big data technology,Internet data shows a trend of explosive growth.News websites,social softwares and application softwares generate massive data at all times.These data can exist in the form of text,images,videos,audio,etc.In this paper,these different types of data are referred to as multimodal data.In the era of big data,the traditional single-modal retrieval methods obviously cannot satisfy the mutual retrieval of large and mixed multimodal data.In order to search out the corresponding information more effectively according to the user's interest,a fast and accurate cross-modal retrieval method needs to be developed to efficiently retrieve multimodal data from the Internet.The research on cross-modal data retrieval has gradually become the focus of many scholars and researchers.This paper uses the label information of data to learn the underlying features,while ensuring that the high-level semantic information will not be lost during the learning process.The data of different modalities are projected into a common feature space through linear or non-linear mapping methods.Label information can guide feature learning and mine the relevance of data in different modalities.Learning the semantic consistency between and within modalities based on supervised learning can fully mine high-level semantic information and retain semantic relevance of data.After in-depth study and research on cross-modal retrieval technology,three cross-modal retrieval methods based on supervised learning are proposed:1.In order to improve the linear discrimination of features and avoid the error caused by the quantization of hashing codes,Supervised Kernel Function for Discrete Cross-modal Hashing is proposed,referred to as SKFDCH for short.The data of different modalities are nonlinearly projected to different high dimensional spaces by kernel function,so as to solve the linear indistinguishable problem of data in low dimensional space.Based on the idea of matrix factorization,the latent semantic hashing space of each modality is learned,and the data of different dimensions are linearly converted into their own modal-specific hashing codes,so as to retain the semantic information expressed by the data in different modalities.A semantic affinity matrix is defined by combining the semantic information and label information contained in each modality.The similarity of hashing codes is modeled in the specific latent semantic hashing space of each modality to obtain hashing codes with stronger semantic differentiation.In the optimization stage,a discrete algorithm is adopted to directly learn the hashing codes,which effectively avoids the quantization error.2.In order to effectively reduce the difference between modalities,and retain the modal-specific semantic information,Modality Consistent Generative Adversarial Network for Cross-modal Retrieval is proposed,referred to as MCGAN.Through the generative adversarial network,the crossmodal retrieval task is approximately transformed into a single-modal retrieval task,which can effectively retain the semantic information of image modality.A modality consistent embedding network is designed to project image features and generated features into a common semantic space,and model the inter-and intra-modal similarity with label information.The loss function of label classification and the loss function of category center are defined to train and update the parameters of the overall network.Finally,the real value features with strong semantic discrimination are obtained,which effectively improve the accuracy of mutual retrieval between image and text.3.In order to reduce the difference between modalities,fully mine the semantic similarity between and within modalities,and reduce the space required for data storage while reducing the loss of useful information,Semantic Correlation Generative Adversarial Network for Cross-modal Hashing is proposed,referred to as SCGAN.This method combines generative adversarial network with hashing learning.Firstly,text features are projected into the feature space of image through a generative adversarial network,which effectively reduces the differences between two modalities.Secondly,the generated features and real image features are projected into a Hamming space through a semantic correlation hashing network,and the hashing codes are obtained through a symbol function.Finally,label information is used to mine the semantic relevance of data,and three loss functions are defined to help the training of network: the label classification loss function models the similarity of hashing features in the same modalities,the threshold discriminant loss function models the similarity between two modalities,and the hashing metric loss function can effectively reduce the error during the quantization of the hashing codes.Related experiments are performed on Wikipedia dataset and NUS-WIDE dataset,which are commonly used in cross-modal retrieval research.Comparative experiments with some popular methods under the same experimental settings prove that the three methods proposed are effective and feasible.
Keywords/Search Tags:Cross-modal retrieval, Supervised leaening, Hashing learning, Deep learning, Generative adversarial networks
PDF Full Text Request
Related items