Font Size: a A A

Image Captioning Based On Self-Attention Network

Posted on:2022-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z LiFull Text:PDF
GTID:2558306914962529Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Image captioning task aims to generate descriptive natural language sentences for a given image.This task connects the two fields of computer vision and natural language processing,and is one of the multimodal tasks.Its research progress is crucial to break the semantic gap between images and text.In recent years,with the development of deep learning,the model structure of Convolutional Neural Network as encoder and Recurrent Neural Network as decoder is widely used in image captioning tasks.The inherent sequential structure of the Recurrent Neural Network creates a memory recession problem,resulting in the model focusing on less information above at the current moment than the information above at the previous moment.At the same time,the decoder only uses the visual feature information of the image in the process of decoding,ignoring the spatial information between objects in the image.Based on the above problems,the following work is carried out in this paper.In this paper,an image captioning model of self-attention network is used.The inter-modal attention module is designed and implemented for the problem of weak modal interaction capability of traditional selfattention mechanism.I design and implement the intra-modal attention module to solve the problem that the query vector vanishes when the selfattention mechanism propagates forward in the stacked network.After performance experiments and ablation experiments on MS-COCO and Flickr30K datasets,the effectiveness of the proposed intra-modal attention module and inter-modal attention module is verified by quantitative and qualitative analyses.An image captioning model incorporating spatial information of images is designed.For the lack of spatial relationship between objects in the image during the decoding process of the model,the proposed spatial location coding module based on intersection over union makes the model enrich the feature information of the image.After performance experiments and ablation experiments,the effectiveness of the proposed spatial position coding based on intersection over union is verified through qualitative and quantitative analyses.Based on the above image captioning model,we design and implement an image captioning system.The system supports online model selection and gives generated description results based on the uploaded images.Finally,the function of the system are tested.Extensive experimental results show that the system can generate accurate and detailed description statements according to the image captioning algorithm model.
Keywords/Search Tags:deep learning, image captioning, self-attention network, spatial information encoding
PDF Full Text Request
Related items