Font Size: a A A

Research On Unsupervised Cross-modal Retrieval Based On Pairwise Similarity Informatio

Posted on:2022-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhaoFull Text:PDF
GTID:2568307070453094Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid growth of multimodal data,cross-modal retrieval has attracted much attention.Cross-modal retrieval task can be understood as using one type of data as query input to retrieve another type of related data.For example,users can use text to retrieve related pictures or videos.Since the query input and the retrieval result do not belong to the same modal,how to measure the content similarity between different modal data,and how to map the multi-modal data under different manifolds to the same public hidden space is currently the main issue challenge.In this article,we will review some representative methods of crossmodal retrieval and divide them into two categories: real-valued representation learning and binary representation learning.The real-valued representation learning method aims to learn real-valued shared representations of different data modalities.In order to speed up cross-modal retrieval,many binary representation learning methods map data of different modalities to the shared Hamming space.According to the information used when learning public representations,cross-modal retrieval methods can be further divided into two categories: 1)unsupervised methods and 2)supervised methods.Generally speaking,the more information a method uses,the better the performance it obtains.Cross-modal retrieval with supervised information often results in better retrieval results because the label provides more information,but in practice,a large number of manual annotations need to be obtained,and it is sometimes expensive to obtain label information.However,it is often difficult to achieve satisfactory results without using any information.Therefore,the research on pairwise similar information has become the main focus of attention recently.This paper aims to study cross-modal retrieval research based on pairwise similar information,and is dedicated to improving retrieval accuracy while saving space.The technology currently has the following three main problems to be solved:(1)The problem of modal information extraction and neighbor information mining.The unsupervised cross-modal retrieval algorithm is subject to the lack of label information,cannot obtain complete original neighbor information,and its performance is weaker than the supervised algorithm with label information.Although it is more difficult to obtain label information,it is easier to obtain paired information in the training process.Reasonable use of paired information can greatly improve the performance of the model.Therefore,it is necessary to study a suitable network model to mine the value of paired similar information and obtain effective adjacent similar information.(2)The problem of inconsistency of approximate information between modes.Since the data of each modal will enter manifolds of different dimensions after passing through the feature extraction network,the similarity of the data between different modalities cannot be directly calculated.At the same time,there are inconsistencies in the data distribution in different manifolds.In the absence of label information,the neighbor information extracted from the manifold where each mode is located lacks a unified metric,and there will be differences.For this reason,it is necessary to find a network model to obtain a unified neighbor relationship.(3)The rational use of global approximate information and local approximate information.The approximate information obtained for the entire data set can be called global proximity information.At present,the commonly used method of deep learning in optimizing the objective function is the batch gradient descent method,and the data needs to be input into the network model in batches.The sparseness of global proximity information will be amplified in small batches of data.Therefore,it is necessary to further explore local proximity information and make reasonable use of the two in order to further improve retrieval performance.Based on the above three issues,the main content and contributions of this article are as follows:(1)In this paper,we propose a fine-grained cross-modal retrieval algorithm for learning isomorphic and heterogeneous information.Based on the phenomenon of "modal cooccurrence",we align data of different modalities.The algorithm first uses Faster-RCNN combined with the attention mechanism to obtain multiple candidate objects in the image,uses two-way GRU to construct word vectors for the words in the text information,and then maps the candidate objects and word vectors into the same common hidden space.The algorithm constructs dense graphs in different modalities to save neighboring information.At the same time,a "two-steps" algorithm is proposed to learn isomorphic and heterogeneous information in dense graphs through a self-attention mechanism.Here,the sample in the latent space can learn the characteristics of other samples that co-occur with itself multiple times,and the paired information can control the co-occurrence of the samples between the different modes that are semantically closest.This is " "Modal co-occurrence" phenomenon.(2)This paper proposes a cross-modal hash retrieval algorithm based on a variational auto encoder.The algorithm first uses the variational auto encoder to embed the sample in the common latent space,and then obeys the multivariate Gaussian distribution based on the features in the shared latent space The hypothesis of is to update the mean vector and the cluster centroid of the potential features at the same time by minimizing the loss of clustering,so as to reduce the distance between the cluster centroid and the mean vector,and further make the clustering more compact.This algorithm introduces reconstruction loss to improve performance.Experiments on multiple data sets prove the effectiveness of this algorithm.(3)This paper proposes a cross-modal hash retrieval algorithm that combines global sparse approximation information and local dense approximation information.The algorithm first uses the k NN algorithm to construct a sparse map of global neighbors within the modal,and then uses a multi-modal graph clustering algorithm to obtain unified semantic neighbor information.Then combine the local neighbor dense graphs constructed in each batch to obtain the final approximate relationship.The algorithm uses approximate relations to guide the training of hash functions.
Keywords/Search Tags:unsupervised learning, sparse graph, cross-modal retrieval, hash algorithm, pairwise similar information
PDF Full Text Request
Related items