Font Size: a A A

Research On Relevance Computation Of Cross-modal Retrieval

Posted on:2019-09-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:J F DongFull Text:PDF
GTID:1368330548977388Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid advancement of technologies,such as the Internet,smart phones,social media and instant messaging,people can readily create and share multimedia data in various modalities(such as text,images,videos,etc.)at anytime and anywhere.Facing the deluge of multimedia data,it has a practical value on how to efficiently and effectively retrieve the multimedia data that users are desired or interested.Among them,cross-modal retrieval allows the input query and the retrieved object being different modality,which achieves the goal of mutual retrieval between different modal data.For example,retrieving text by image query and retrieving images by text query,etc.Due to the flexible of cross-modal retrieval,cross-modal retrieval is better to meet the users'requirement and has already become a hot topic in the field of multimedia retrieval.For a given query,cross-modal retrieval techniques rank the candidate objects in terms of their relevance with the given query,thus obtain the final search result.Therefore,the key of the cross-modal retrieval is to calculate the cross-modal relevance between multimedia data of different modalities.Since the underlying features of different data are heterogeneous and incomparable(it is called a heterogeneous gap),which makes it a great challenge to calculate the cross-modal relevance.Given the above challenges,this paper focuses on the three most common data types:text,image and video,and conducts in-depth research on text and image based cross-modal retrieval,and text and video based cross-modal retrieval.In this paper,we calculate the cross-modal relevance from the perspective of cross-modal data represen-tation and common space selection,and propose a series of cross-modal retrieval models and verify their viability on several public benchmarks.Also,this paper systematically evaluates the current mainstream cross-modal retrieval models,exposing the advantages and disadvantages of these models.And we also propose a cross-modal relevance fusion framework which can further improve the performance and robustness.The main innovations and contributions of the paper can be summarized as follows:1.Existing cross-modal retrieval models are mainly based on the overall semantic information.However,the salient information that has been widely concerned in image analysis and single-modal image retrieval has not been explored in the field of cross-modal retrieval.This paper tries to explore the salient information of the text and the video data,and proposes a feature representation method that could obtain both of the overall semantic information and salient information of input data.The proposed feature representation method can be readily applied to common space learning based cross-modal retrieval models and similarity met-ric based cross-modal retrieval models,which shows its good generality.The experimental results also prove that the extra salient information is helpful for cross-modal retrieval.2.Mainstream common space learning based cross-modal retrieval methods mainly rely on mappings that from different modalities into latent subspaces.However,the latent subspace lacks actual physical interpretation and requires two mappings to achieve cross-modal correlation calculation.This paper proposes to use deep visual feature space as a common space directly,so a simple one-way mapping are required thus accomplish the relevance calculation.Therefore,this paper pro-poses a deep neural network model that could predict visual features from text input,so that text and visual objects can be represented in the deep visual space to achieve cross-modal retrieval.The proposed model is applied to image-related and video-related cross-modal retrieval and outperforms the mainstream latent sub-space based cross-modal approaches in the four mainstream public datasets.The experimental results demonstrate the feasibility of choosing deep visual space as common space for cross-modal retrieval.3.Although there is a large number of cross-modal retrieval models have been pro-posed,most of the models are evaluated in the experimental environment and their performance in the real application is unclear.This kind of evaluations does not good for us to better understand the model thus hinder to improve the model.Based on the large-scale query log data of commercial search engines,this pa-per systematically evaluates the mainstream text and image based cross-modal retrieval model.Specifically,the paper proposes a simple matching based base-line method to help expose nature of complex advanced models compared to the simple one,and further to the evaluate models by robust analysis and statistical significance test.Moreover,this paper categorizes the textual query by introduc-ing a concept of query visualness,which helps to make a more detailed analysis of the retrieval results and understand the advantages and limitations of the model.4.Different features and different cross-modal retrieval methods usually have their own unique mechanisms,advantages and limitations,so different methods with different features may be complementary.This paper systematically studies the characteristics and performance of the two schemes of feature fusion and method fusion,and proposes a cross-modal relevance fusion framework.The proposed fusion framework supports the fusion of any cross-modal retrieval methods and has excellent scalability.The experimental results show that the fusion framework not only improves the performance of cross-modal retrieval,but also improves its robustness.5.We build a demo system of cross-modal retrieval based on the proposed cross-modal relevance fusion framework.We implement the above proposed cross-mode retrieval methods on the demo system showing that the above methods are practicable in the real world application of cross-media retrieval.
Keywords/Search Tags:Cross-modal Retrieval, Multimedia Retrieval, Relevance Computation, Feature Representation, Common Space, Latent Subspace, Query Visualness, Salient Information, Deep Learning
PDF Full Text Request
Related items