Font Size: a A A

Image-Text Retrieval Based On Hierarchical Interaction Network

Posted on:2020-10-06Degree:MasterType:Thesis
Country:ChinaCandidate:J LinFull Text:PDF
GTID:2428330590984286Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,it has become more and more common for people to publish text and image on the web platforms.Therefore,it's meaningful to design an efficient image-text retrieval method to help users search for valuable content more accurately and conveniently from massive images and text data.Recently,with the dramatic development of deep learning,multi-modal based retrieval has attracted extensive attention and various deep learning based methods have been proposed.However,there is a large semantic-gap between image and text in existing methods.Representation based methods can't obtain satisfactory performances by simply calculating the similarity after mapping text and image into a common space.Therefore,extracting and correlating different granularities of information of image and text to reduce the semantic-gap has become an extremely challenging issue in image-text retrieval task.In this paper,we adopt the interaction based methods,which match the word features with the visual proposal region features,and we further improve the interaction based methods by proposing a hierarchical structure and two suppression mechanisms.The main contributions of this paper are as follows:(1)A Hierarchical Interaction Network(HIN)is proposed,which includes hierarchical semantic information and hierarchical attention.The hierarchical semantic information can exploit the uni-gram information from text and image to build the feature interaction matrix.Furthermore,it can also make use of the n-gram information to help derive the abundant semantic information during the matching of text and image.The hierarchical attention is to introduce the attention mechanism at word-level(proposal-level)and sentence-level(imagelevel)respectively,so that the key information of text and image can be extracted more accurately.(2)Two boosted mechanisms for suppressing redundant matching are proposed,including Proposal Gate and Central Attention.The interaction based methods use the fine-grained features(uni-gram information)of the text and image to match one by one so as to reduce the loss of semantic information.However,these methods are likely to form redundant matching.Proposal Gate exploits a trainable gating threshold to suppress some redundant proposal regions which are irrelevant to the matching text.Central Attention is to predict the best matching position of text for a proposal region,and then suppress the surrounding words centering on the position.(3)Finally,we conduct a series of experiments to verify that the proposed method can achieve better performances of image-text retrieval on both Flickr30 K and MSCOCO datasets.
Keywords/Search Tags:Image-Text Retrieval, Semantic Matching, Multi-Modal, Deep Learning
PDF Full Text Request
Related items