Image-Text Retrieval Based On Hierarchical Interaction Network

Posted on:2020-10-06

Degree:Master

Type:Thesis

Country:China

Candidate:J Lin

Full Text:PDF

GTID:2428330590984286

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet,it has become more and more common for people to publish text and image on the web platforms.Therefore,it's meaningful to design an efficient image-text retrieval method to help users search for valuable content more accurately and conveniently from massive images and text data.Recently,with the dramatic development of deep learning,multi-modal based retrieval has attracted extensive attention and various deep learning based methods have been proposed.However,there is a large semantic-gap between image and text in existing methods.Representation based methods can't obtain satisfactory performances by simply calculating the similarity after mapping text and image into a common space.Therefore,extracting and correlating different granularities of information of image and text to reduce the semantic-gap has become an extremely challenging issue in image-text retrieval task.In this paper,we adopt the interaction based methods,which match the word features with the visual proposal region features,and we further improve the interaction based methods by proposing a hierarchical structure and two suppression mechanisms.The main contributions of this paper are as follows:(1)A Hierarchical Interaction Network(HIN)is proposed,which includes hierarchical semantic information and hierarchical attention.The hierarchical semantic information can exploit the uni-gram information from text and image to build the feature interaction matrix.Furthermore,it can also make use of the n-gram information to help derive the abundant semantic information during the matching of text and image.The hierarchical attention is to introduce the attention mechanism at word-level(proposal-level)and sentence-level(imagelevel)respectively,so that the key information of text and image can be extracted more accurately.(2)Two boosted mechanisms for suppressing redundant matching are proposed,including Proposal Gate and Central Attention.The interaction based methods use the fine-grained features(uni-gram information)of the text and image to match one by one so as to reduce the loss of semantic information.However,these methods are likely to form redundant matching.Proposal Gate exploits a trainable gating threshold to suppress some redundant proposal regions which are irrelevant to the matching text.Central Attention is to predict the best matching position of text for a proposal region,and then suppress the surrounding words centering on the position.(3)Finally,we conduct a series of experiments to verify that the proposed method can achieve better performances of image-text retrieval on both Flickr30 K and MSCOCO datasets.

Keywords/Search Tags:

Image-Text Retrieval, Semantic Matching, Multi-Modal, Deep Learning

PDF Full Text Request

Related items

1	Research On The Method Of Cross-modal Image And Text Retrieval Based On Deep Learning
2	Research On Content Sifting And Storage Mechanism Of Cross-modal Image And Text Data Based On Semantic Similarity
3	Cross-Modal Retrieval Of Image-Text Based On Deep Learning
4	Research On Cross-modal Multimedia Retrieval Method Based On Neural Network
5	Research On Image Semantic Information Extraction And Retrieval Optimization In Big Data Scenarios
6	Semantic Transfer Hashing Based On Deep Learning For Cross-modal Retrieval
7	Research On Cross-modal Retrieval Method Based On Deep Semantic Hashing
8	Research On Classification And Retrieval Techniques For Multi-Modal Data
9	Deep Network For Image-Text Cross-Modal Retrieval
10	Image-text Cross-modal Retrieval Based On Deep Hashing Method