Font Size: a A A

Image And Text Retrieval Method With Fine-grained Semantic Features

Posted on:2023-07-12Degree:MasterType:Thesis
Country:ChinaCandidate:X XiaoFull Text:PDF
GTID:2558307115987989Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,a large amount of multimedia content has emerged on the Internet,including multi-modal information such as images and texts,and people’s cross-modal retrieval requirements among these information have followed.In this paper,a generative image and text retrieval model is built on the traditional recurrent neural network and Transformer respectively,and a self-attention feature fusion encoder module is added on the basis of Transformer,and a fine-grained image and text retrieval method based on Transformer is proposed.Aiming at the problem of weak correlation between images and texts in traditional direct retrieval methods,this paper constructs three generative retrieval models to explore the influence of fine-grained semantic features of images on cross-modal retrieval of images and texts.First,in the image and text retrieval model based on the recurrent neural network,the GRU with attention is used to decode the image,and the image and the text are encoded to calculate the similarity;in order to make the image have the ability to generate as close to the real caption as possible,add An image caption loss function optimization model.Secondly,in the image and text retrieval model based on Transformer,a fine-grained interaction model between image and text is introduced,which can directly calculate the similarity between the two features inside the model.Finally,in the fine-grained image and text retrieval model based on Transformer,the pre-trained convolutional neural network is used to extract the global features of the image,and the pre-trained Faster R-CNN is used to map to the target area of the image,and then the local features of the image are extracted;The designed self-attention feature fusion encoder module fuses image global features with local features to enhance the representation of fine-grained semantic information in images.In addition,this paper constructs task scenarios with image-text retrieval and image caption respectively,tests three generative image-text retrieval models on the ICC dataset,and analyzes the results of image caption and image attention distribution visualization.Experiments show that the fine-grained image and text retrieval method based on Transformer proposed in this paper improves the accuracy of image and text retrieval and image caption,and enhances the representation of image target regions and texts.
Keywords/Search Tags:Image-text retrieval, image caption, fine-grained features, Transformer
PDF Full Text Request
Related items