Research On Cross-Modal Image-Text Retrieval Techniques Based On Semantics And Common Sense

Posted on:2022-02-21

Degree:Doctor

Type:Dissertation

Country:China

Candidate:H R Wang

Full Text:PDF

GTID:1528307034962729

Subject:Information and Communication Engineering

Abstract/Summary:

With the rapid development of Internet technology,people tend to create and share more various multimedia materials(e.g.,images and texts)in their daily life.Faced with explosively growing amount of multi-modal data in the Internet,the traditional single-mode retrieval has gradually unable to meet the needs of users.How to make computer achieve cross-modal retrieval based on understanding the association between multi-modal data has become a critical research topic in the community of multimedia understanding.Therefore,based on exploiting semantics and commonsense knowledge,this thesis engages in the task of cross-modal image-text retrieval.Briefly,the main contributions of this thesis can be summarized as follows:(1)We propose a Stacked Multi-modal Attention Network(SMAN)for cross-modal image-text retrieval,which makes use of stacked attention mechanism to exploit the hierarchical fine-grained interdependencies between image and text.Specifically,we jointly employ intra-modal information and multi-modal information as query guid-ance to perform a multiple-step attention reasoning so that the multi-level fine-grained correlation between image and text can be captured.Moreover,we present a novel bi-directional ranking loss,which imposes the distance constraint on pairwise multi-modal instances to preserve the manifold structure of multi-modal data distribution in the joint embedding space.(2)We propose a Stacked Squeeze-and-Excitation Recurrent Residual Network(SER~2-Net)for cross-modal image-text retrieval.Firstly,an efficient multi-level rep-resentation module is presented to produce a series of semantically discriminative fea-tures,which is built by combining multiple semantic enhancing operations.Besides,to capture the implicit correlations contained among multi-level features,we propose a novel objective namely Cross-modal Semantic Discrepancy(CMSD)loss that exploits the interdependency among different semantic levels to narrow the distribution discrep-ancy between heterogeneous data.(3)We propose a Consensus-aware Visual-Semantic Embedding(CVSE)model for cross-modal image-text retrieval.Specifically,the consensus information is ex-ploited by computing the statistical co-occurrence correlations between the semantic concepts from the image captioning corpus and incorporating the constructed concept correlation graph into graph convolution network to yield the consensus-aware con-cept(CAC)representations.Afterwards,our CVSE can simultaneously pinpoint the high-level concepts and generate the unified consensus-aware concept representations for representing both modalities,thereby achieving more precise semantic alignments between image and text.(4)We propose a Commonsense Aided Visual-semantic Embedding(COVE)model for cross-modal image-text retrieval.Through the combination of knowledge graphs,statistical relation and graph neural network,the logical commonsense knowledge and statistical commonsense knowledge can be simultaneously captured,so that multiple commonsense knowledges can be introduced into the learning procedure of cross-modal representation.Extensive experiments on the two public benchmark dataset,i.e.Flickr30k and MSCOCO datasets,demonstrate the superiority of all four proposed approaches.

Keywords/Search Tags:

Cross-modal Retrieval, Multi-modal Learning, Image-text Retrieval, Attention Mechanism, Commonsense Knowledge

Related items

1	Research On Text-Image Cross Modal Retrieval Method
2	Research And Application On Cross-Modal Retrieval Methods For Image-Text
3	Research On Image-text Cross-modal Hash Retrieval Based On Semantic Preservation And Attention Mechanism
4	Cross-modal Retrieval Based On Deep Model Learning
5	An Optimized Approach To Cross-Modal Retrieval Based On Multi-level Attention Mechanism
6	Multi-branch Cross-Modal Person Reidentification Algorithm With Fused Attention Hash Coding
7	Research On Image And Text Retrieval Based On Attention Mechanism
8	Research On Cross-modal Retrieval Between Recipe Image And Text Based On Instruction Importance
9	Image-text Translation Based On Cross-modal Related Semantics And Attention Mechanism
10	Cross-Modal Retrieval Of Image-Text Based On Deep Learning