Font Size: a A A

Research On Cross Modal Entity Alignment For Image And Text Retrieval

Posted on:2024-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:J S WeiFull Text:PDF
GTID:2568307100462254Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the advent of the era of multimedia data and information,there is an urgent need for an effective information retrieval technology to manage massive multimodal data.Therefore,image and text information retrieval has attracted widespread attention.The task of image and text retrieval aims to use data from one modality to retrieve data from another modality,completing mutual complementarity of information and achieving high utilization of information.However,currently image and text retrieval faces many difficulties: cross modal data has semantic gaps and it is difficult to achieve information association.The graphical and textual representation methods are inconsistent,and dynamic entities are difficult to capture.The dynamic entities described in the text cannot find matching image regions.Entity alignment is an important part of image and text retrieval tasks,which mainly corresponds the entities described in images and text to achieve a highly unified heterogeneous semantics of cross modal entities.It is the next stage of feature extraction.Entity alignment can map heterogeneous modal entities to the same common space,completing information exchange and alignment between entities.For this reason,in entity alignment,this article proposes two models to mainly solve the problems existing in current image and text retrieval tasks.The main contributions are as follows:(1)In response to the problem of difficult correlation of information between heterogeneous modalities and within modalities,and insufficient attention to key information,this thesis proposes a Graph Text Alignment Model(BSAM)based on Bert and self attention mechanism.It mainly realizes the mutual correlation between information within heterogeneous modalities and information between modalities,aligns the detailed features of cross modal entities,and achieves full correspondence between entities described in images and texts.By introducing a self attention mechanism to establish association information within the image modality,the text adaptively extracts contextual information and association information between words through the Transformer module of the Bert model.Focus on the detailed features of entities in image and text descriptions,and achieve image and text entity alignment by aligning the detailed features of entities in the image and text.Introduce cross attention mechanism and similar attention filtering(CA-SAF)module in the entity alignment process,calculate all relevant detail features,enhance highly correlated matching pairs,filter irrelevant matching pairs,reduce computational complexity,and solve data redundancy problems.(2)A graph text alignment model(GCAN)based on gated cyclic attention network is proposed to address the problem of dynamic entities in text descriptions being difficult to correspond to relevant dynamic regions of images.It is mainly used to align entity information related to actions in graph text.This model annotates the text through part of speech and inputs it into the Bert model to extract word and sentence features,and inputs verb features into the Dynamic Entity Capture Unit(DEC)to capture the action area information of entities in the image.Then,the extracted image area with action information is encoded,and the encoded image area contains contextual information to achieve dynamic entity alignment.In order to solve the problem of missing word order in images caused by local fragments,global information was introduced to supplement local missing information and achieve alignment of all entities in the image and text.The main work of this thesis is to propose two models: BSAM and GCAN.BSAM is committed to solving the alignment of detailed features of entities,which refer to all static entities.GCAN is committed to solving dynamic entity alignment in images and text.Entities include dynamic entities and static entities,and alignment of graphic and textual entities is achieved from these two aspects.Solve the problem of incomplete image and text alignment and inaccurate matching results.The effectiveness of our method in image and text retrieval has been demonstrated through experiments,which can improve the accuracy of image and text retrieval.In the future,the speed of image and text retrieval can be improved by compressing the perspective of the frame structure,and the research method in this article can be applied to the study of other modalities.Overall,this method provides an effective solution for image and text retrieval and provides inspiration for future related research.
Keywords/Search Tags:entity alignment, deep learning, attention mechanism, image and text retrieval
PDF Full Text Request
Related items