Font Size: a A A

Image-Text Cross-Modal Retrieval Method Based On Scene Graph

Posted on:2023-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:Z W YangFull Text:PDF
GTID:2558307118996329Subject:Computer Science and Technology
Abstract/Summary:
The surge in the number of active users of new media platforms has resulted in a persistently high volume of multimedia data.On this basis,cross-modal retrieval technology in multimedia scenarios has attracted interest.Retrieval images with natural language or retrieval of text information with images align with the natural humancomputer interaction.This mutual retrieval method between different modal data is called cross-modal retrieval.Due to the heterogeneity of multimodal data,it is difficult for users to efficiently search and obtain information of interest.Therefore,image-text cross-modal retrieval is significant for improving user comfort,product intelligence,and humanization.In response to the current situation,this work investigates the topic of image-text crossmodal retrieval.The issues are twofold: the sample bias will influence global-based feature extraction,which will eventually be driven by the principal item while disregarding other objects.The approaches based on local alignment have several drawbacks,including the loss of semantic and attribute information in the original modal data and the difficulty of fully exploiting the overall data,which affects the model’s performance.In response to the primary concern,this work investigates the goals and issues of image-text cross-modal retrieval,as well as related approaches,with an emphasis on feature balancing and measurement issues in the retrieval process,and suggests several solutions:First,the core challenge of image-text cross-modal retrieval is insufficient information utilization within modalities and an imbalance in the amount of information and carrying capacity between modalities,making it difficult to align retrieval objects accurately during model training and thus limiting model performance.The graphical model is one of the main methods to maintain the relationship between objects in different modalities.Its goal is to turn the modal’s objects into nodes,the relationships between them into edges,and maximize the distance and position in the original modal.It’s simpler to enrich feature discrimination and lower the intra-class distance of comparable items with the help of graphical structure.However,the current graphical model is too essential to achieve a fair comparison.It is still challenging to measure the characteristics of different modalities at the same level for single-modal graph modeling.This paper first proposes an Auxiliary Bi-level Graph representation for image-text cross-modal retrieval network(ABGR-Net)to tackle the above problem.The scene graph is employed as an intermediary modality in the original feature matching architecture with unaligned feature subspaces,and image and text modality information is modeled simultaneously.To eliminate unnecessary noise and redundant information,the original properties of the scene graph are employed to exploit the latent information inside the modality,kept as preliminary information extraction and distillation.Then,the graph convolutional neural network is used with different fusion strategies on the two types of nodes,mining potential information at two levels of semantics and graph structure.Finally,the enhanced features are used for feature measurement.Extensive experiments on standard retrieval datasets: MS-COCO dataset and Flickr30 k dataset demonstrate the effectiveness of ABGR-Net.The results show that the ABGR-Net proposed in this paper is improved by 3%-5% compared with the baseline model in various indicators.The comprehensive indicator R@sum is improved by up to 20%.Then,based on the auxiliary bi-level graph representation model ABGR-Net,the original similarity calculation method uses the maximum similarity score of image-text pairs.Only one connection possibility between images and texts is explored,ignoring the complicated many-to-many relationship and accompanying semantics between image-text pairings and compromising information transmission integrity.This paper proposes two improved similarity measurement methods,the K-Max method and the Soft-Max method,which consider the possibility of multiple pairs of similarities and comprehensively consider the overall similarity,respectively.The method of Soft-Max fully considers the connection between the probe data and the gallery data,alleviates the incomplete transmission of information,and achieves the purpose of optimizing the similarity calculation process.In addition,this paper considers the difference between cross-modal retrieval and conventional retrieval,that is,bidirectional(image-text,textimage)retrieval demand in cross-modal retrieval tasks.The unilateral feature retrieval and calculation procedure in the existing cross-modal retrieval methods ignore the requirements of bidirectional retrieval and the possibility of joint optimization.Combined with the dual-branch design of the framework of the ABGR model and the conditions of bidirectional retrieval,this paper further procedure a cross-modal retrieval optimization method based on joint ranking.An optimization method for bidirectional image-text joint ranking is proposed by comprehensively considering the similarity scores of two tasks.It reorganizes the image-text pair with the highest metric score in the bidirectional similarity and jointly optimizes the final retrieval result.On the standard image-text retrieval dataset,using the retrieval optimization method,compared with ABGR-Net,the retrieval accuracy is improved by 1%-3%,and the comprehensive metric R@sum in the two datasets enhanced by more than 10%.This paper conducts in-depth research on image-text cross-modal retrieval methods through the above analysis.It provides practical solutions to the critical problems in image-text cross-modal retrieval.A comprehensive comparison with Baseline and state-of-the-art methods prove the effectiveness of the method proposed in this paper,showing the importance of fair comparison of different modal information in cross-modal retrieval.Corresponding improvements have been made in feature extraction,feature enhancement,and feature measurement in this paper,improving model performance.
Keywords/Search Tags:Image-Text Retrieval, Cross-Modal, Scene Graph, Graph Convolutional Network, Joint Ranking
Related items