Font Size: a A A

Structure Preserving Cross-modal Retrieval Based On Deep Learning

Posted on:2022-04-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:P P KangFull Text:PDF
GTID:1488306317994179Subject:Computer applications engineering
Abstract/Summary:PDF Full Text Request
With the development of multimedia technology,things are shown in more various forms,such as image,audio,etc.,which are called multi-modal data.At the same time,with the popularization of the Internet,every user becomes a producer and a receiver of information,which promotes the dissemination and growth of large-scale multi-modal data,making the multi-modal research a new trend.It is worth noting that cross-modal retrieval uses the query of one modality to retrieve semantically related information of other modalities,which has become an important research branch in the fields of multi-modal learning,pattern recognition,and artificial intelligence.Compared with single-modal retrieval,cross-modal retrieval is more challenging.Firstly,the query and the retrieved data are in different feature spaces,which is difficult to directly calculate their similarity,that is,there is a heterogeneous gap.In addition,the original features in each modality are hard to reflect the high-level semantic relationship,that is,there is a semantic gap.Therefore,how to effectively use the correlation between multi-modal data and eliminate the heterogeneous gap and semantic gap,is the difficulty of multi-modal learning.In recent years,some scholars have proposed many cross-modal retrieval methods through transforming cross-modal data into a unified common space.By using different techniques,they explore multiple structural information to promote data fusion and improve the discriminative ability,which have made great progress.With the development of deep neural networks in the fields of computer vision,natural language processing,and video analysis,the deep learning based cross-modal retrieval methods have achieved remarkable results.However,due to the complex structure of multi-modal data and the insuficient exploration and utilization,there is still a lot of space for performance improvement in cross-modal retrieval.To this end,this paper explores the structural information under the framework of deep learning,and proposes a series of multi-modal fusion algorithms.These algorithms can not only be extended to solve the problems of multi-modal learning,multi-view learning,and domain adaptation,but also be applied to research fields of cross-modal retrieval,social opinion discovery,personalized recommendation,identity authentication,and so on.It is of great research value and broad practical meaning.The main works of this paper is summarized as follows(1)In order to reduce the semantic gap and the heterogeneous gap of multi-modal data,a deep semantic space learning model with intra-class low-rank constraint(DSSIL)is proposed.It is composed of two subnetworks for modality-specific feature learning,and two projection layers that transform feature spaces into a common semantic space to alleviate the semantic gap.Enlightened by the idea of low-rank representation,DSSIL constrains the low-rank structure of cross-modal data in the same category,to reduce the heterogeneous gap between different modalities.More formally,two regularization terms are devised for the two aspects,which have been incorporated into the objective of DSSIL.In addition,to solve the non-differentiable problem of the low-rank constraint,an approximate optimization algorithm is proposed for network back-propagation during the optimization process.(2)In view of some defects of the DSSIL,including the neglection for the low-rank structure in each single modality,the requirement of labeled data during supervised training,and some insufficient analysis problems,this paper proposes two deep models:intra-class low-rank regularization for supervised and semi-supervised cross-modal retrieval,denoted as ILCRM and Semi-ILCMR respectively.The main idea of ILCMR is to consider both the intra-modal and the inter-modal low-rank structure.In other words,on the basis of DSSIL,a low-rank constraint is imposed on each single modality,so that the intra-modal samples in the same category are more similar in the subspace.Moreover,in order to make full use of the labeled and the unlabeled data,Semi-ILCMR is proposed based on the framework of ILCMR.By constraining the low-rank structure of the labeled and unlabeled feature representations,the scarce labeled samples can supervise and guide feature learning,while the large number of unlabeled samples can enhance the learning process by exploring the low-rank structure.(3)Due to the low storage and high-speed retrieval of hash representation,a joint semantics preserving hashing method(JSPH)is proposed for cross-modal retrieval.It learns semantically discriminative hash codes and linear hash functions simultaneously.Specifically,by introducing a linear hash function for each modality,the samples from different modalities are mapped into the Hamming space,and the unified hash codes can be acquired for paired cross-modal data.In order to eliminate the semantic gap and the heterogeneous gap,the idea of semantic graph and the local structure preservation is introduced.Then,the hash codes learning and the hash functions learning are merged into one framework,and optimized simultaneously through a designed alternating algorithm,thereby improving the retrieval performance.(4)Since some shortcomings exist in JSPH,including the limited feature extraction capability of linear hash functions,the inability to preserve semantic structure through batchwise training,and the difficulty to implement JSPH on deep networks,this paper proposes the deep fused two-step cross-modal hashing method with multiple semantic supervision(DFTH).DFTH learns unified hash codes for paired cross-modal samples through a fusion network.Semantic label reconstruction and semantic similarity reconstruction have been introduced to acquire binary codes that are informative,discriminative and semantic similarity preserving.Then,two modality-specific hash networks are learned under the supervision of common hash codes reconstruction,semantic label reconstruction,intra-modal and inter-modal semantic similarity reconstruction.Besides,to avoid the vanishing gradients of binarization,a continuous and derivable activation function is utilized to approximate the discrete sgn function,making the network able to back-propagate by automatic gradient computation.(5)Because DFTH is a supervised method,the guidance of semantic labels is necessary for modal training.However,the labeling task costs a lot of manual and time resources,so this paper proposes a pairwise similarity transferring hashing method(PSTH)for unsupervised cross-modal retrieval.The main idea of PSTH is to transfer the data similarity in each original feature space to the Hamming space,thereby learning modality-specific hash codes that preserve original similarity structure.Furthermore,by maximizing the similarity of paired cross-modal samples,instead of artificially creating a semantic similarity matrix,the semantic similarity structure of cross-modal data can be self-learned,and the heterogeneous gap can be narrowed.
Keywords/Search Tags:Cross-modal retrieval, cross-modal hashing, deep learning, structure preserving, intra-class low-rank
PDF Full Text Request
Related items