Font Size: a A A

Research On Image And Text Retrieval Based On Attention Mechanism

Posted on:2022-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y J ShiFull Text:PDF
GTID:2518306605471594Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
With the advent of the information age,all kinds of data have grown explosively on the Internet.These data are updated fast and have a rich variety,which greatly increases the difficulty of information retrieval.In recent years,the rapid rise of artificial intelligence technology represented by data mining and deep learning has brought hope for the rapid development of information retrieval technology.Meanwhile,how to enable cross-modal retrievals,such as using text to search the relevant images or videos,has become a hot topic in recent years at home and abroad.However,the main difficulties of this research are as follows: first,the heterogeneity of multimodal data results in the semantic gap,which makes it difficult for different modal information to be retrieved from each other;second,crossmodal retrieval requires to understand not only the language semantics and visual contents,but also cross-modal relationships and alignment.To address the above issues,this thesis proposes and implements text-image cross-modal retrieval based on attention mechanism.Firstly,this thesis proposes a bi-directional attention network for image-text retrieval.This algorithm extracts global representations for image and sentence,devises a bi-directional module by considering the interactions between both modalities,explores the interactions between images and sentences before calculating similarities in the joint space,which enables message passing across modalities,and effectively weakens the heterogeneity of the data.Bi-directional attention can be used as auxiliary information for another modality,and both modalities supplement each other so that the model can learn discriminative deep features.Extensive experiments demonstrate that the proposed algorithm in this thesis outperforms traditional global correspondence methods,and show the effectiveness and necessity of exploring the interactions between visual and textual modalities for text-image retrieval.Next,this thesis presents a fine-grained feature alignment model for image-text retrieval.This algorithm uses a bottom-up attention model to extract the region-level visual features,and a bidirectional Gated Recurrent Unit to extract word-level sentence features,introduces fine-grained cross attention to discovers all latent visual-semantic alignments.In addition,in order to further strengthen the association within matched pairs and minimizing the feature distance,this algorithm introduces angular margin loss to project the vector into angular space.Extensive experiments show that compared with the coarse matching using global information,the method of fine-grained feature alignment can effectively improve the retrieval accuracy and make the image text matching interpretable.Finally,this thesis presents a multi-modality graph-structured network for image-text retrieval.Using the effectiveness of graph neural networks in modeling relations among different nodes and learning powerful node representations,the algorithm divises three independent graphs to capture different relations among image regions and words,where visual graph and textual graph model intra-modality using the self-attention module,respectively,and cross-modality graph models inter-modality relationships using a maskattention module,which remedies the finiteness of one modality information on the crossmodel matching task and enhances image and text representations.Extensive experiments show that the inter-modality relationship is as important as the intra-modality relationship in cross-modal retrival.The proposed image-text cross-modal retrieval algorithm studied and implemented effectively solves the the problem of data analyzing and semantic gap caused by data heterogeneity,and has some practical value for semantic information mining in other multimodal tasks.
Keywords/Search Tags:Attention mechanism, Cross-modal retrieval, Deep Learning, Feature embedding
PDF Full Text Request
Related items