Font Size: a A A

Research On Image Caption Generation Model Based On Attention Mechanism

Posted on:2022-10-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y H LiFull Text:PDF
GTID:2518306326984729Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Image caption is a complete and accurate description of image content through coherent natural language,which involves the related content in the field of computer vision and natural language processing.It not only allows the machine to accurately identify the object,scene,capture the attributes of the object and the relationship between the objects,but also makes rational use of the extracted visual feature information,Language model is used to generate accurate caption sentences.Although the research of this task faces many challenges,it has a wide range of applications in human-computer interaction,image retrieval,helping the visually impaired understand the image and so on.This paper improves on the basis of attention mechanism,combines attention mechanism to select image content,pays attention to key information and spatial relationship information between objects,makes the generated description more consistent with image content,and briefly discusses the main work of this paper from the following aspects:(1)Due to the interference of other unimportant information,the traditional image caption generation model can not extract the key information in the image,so the generated description does not have good context information.In order to solve the above problems,an image caption generation model combining bottom-up and top-down attention mechanism is proposed,that is,the bottom-up attention mechanism is used in the feature extraction stage,and the top-down attention mechanism is used in the feature description stage.Combined with the long-term and short-term memory model,the extraction effect of salient features in the image is improved.(2)In order to solve the problem of lack of spatial relation information in image caption generation model,an image caption model based on object relationship network converter is proposed.The model is divided into two steps: firstly,the appearance and boundary frame features of the image are detected by Faster R-CNN,and then the above features are input into the improved converter and the caption is generated by encoding and decoding.In order to strengthen the relationship between the targets,the attention weight of the encoder self-attention layer is improved by combining the appearance features and boundary frame features of the objects into the relationship features.In order to make full use of the image features,the connection between the encoder and decoder is designed as a mesh structure.(3)Two above image caption generation models proposed in this paper are tested on the public image caption data set MSCOCO and the self-made data set UR-caption.The experimental results show that the multi-attention mechanism can effectively extract the salient information in the image at the coding layer,and can also decode and generate the caption consistent with the image content.The object relation mesh converter model is used to make rational use of the spatial relationship between objects and effectively improve the accuracy of the caption statement.
Keywords/Search Tags:image caption generation, deep learning, attention mechanism, long and short time memory network, transformer
PDF Full Text Request
Related items