| Cross-modal image-text retrieval can establish connections between images and text to achieve information interaction and sharing between image and text modalities,and its core goal is to eliminate the heterogeneous gap between different modalities.Mainstream approaches usually detect image regions first and then find the association between image regions and text words.However,these methods mainly focus on the features of salient image regions,rarely consider the correlation between different regions of images,and fail to make full use of image region attribute labels,thus failing to capture more detailed semantic information in images and texts,resulting in insufficient accuracy of graphical text retrieval.To address the above problems,the following research is done in this paper:In this paper,we propose a cross-modal image-text retrieval method based on intra-modal feature enhancement,which further explores the deep intra and inter modal relationships between image and text modalities.The method consists of three main parts:image feature extraction and processing,text feature extraction and processing,and cross-modal interaction.Firstly,the significant regions in the image are detected,the significant region features and the region attribute label features are extracted,and the two are fused to obtain a comprehensive representation of the image,then the association relationship between different regions of the fused image is constructed using a graph convolutional network to achieve semantic enhancement of the significant regions of the image;then the text word feature representation is obtained using BERT and Bi-GRU;finally,the image and text features are cross Finally,the image and text features are interacted across modalities and the similarity between the two is inferred using a stacked cross-attention mechanism.The experimental results on two benchmark datasets,MSCOCO and Flickr30 k,show that the proposed method effectively improves the accuracy of image-text retrieval and validates the effectiveness of the method.In addition,a cross-modal image-text retrieval system is designed and implemented based on the proposed method.The system is equipped with the core functions of "search by text" and "search by image",which receives query samples,invokes the model to perform cross-modal image and text retrieval,and presents the results to users in a friendly way in a visual way.It has been tested and verified that it can effectively meet the demand of cross-text search and has practical application value. |