Font Size: a A A

Research On Image Captioning Based On Neural Network

Posted on:2021-06-20Degree:MasterType:Thesis
Country:ChinaCandidate:M L ZhuFull Text:PDF
GTID:2518306476953369Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Image caption is exactly as what its name implies.Given an image,the computer automatically generates text that describes the content of the image.This task is easy for humans,but very challenging for machines.It requires the use of computer vision and natural language processing to convert image content to descriptive text.Image caption has a wide range of application scenarios,and the application prospect is huge.It is applicable to the fields such as human-computer interaction,image indexing,intelligent monitoring,video annotation,visual assistance,and so on.In recent years,the application of Encoder-Decoder framework based on deep learning for image caption task has made significant progress.Recently,several studies report that the caption model based on self-attention achieved the state-of-the-art results.Compared with the traditional recurrent neural network(RNN)based models,the self-attention based model solves the time dependency problem through the attention mechanism,so it can perform efficient parallel training,and also achieve better performance in context modeling.However,self-attention requires computation quadratic in its sentence length.This thesis mainly studies and explores the image caption method based on Encoder-Decoder framework,combined with deep neural network technologies.The main work and contributions of this dissertation are as follows:1.An image caption model based on lightweight convolution and dynamic convolution structure was proposed.We apply lightweight convolution and dynamic convolution to image caption task as an alternative architecture to the self-attention to decrease the computational cost from O(N~2)to O(N)where N is the sentence length.2.A set of adaptive attention mechanism strategies was proposed to guide the model to extract the image features of different positions at different times.The model can also decide whether to use the visual information or the semantic information of the generated text to predict the current word.We further enhances the performance of the attention module by adding two-dimensional position information of image features.3.The proposed model was evaluated on the MSCOCO dataset.As a baseline,the CNN based model was used.Another baseline was self-attention based model.The results showed the effectiveness of our model.The proposed model achieves better performance than CNN based models,and is competitive to state-of-the-art self-attention based model.
Keywords/Search Tags:Image caption, Neural Network, Lightweight Convolution, Dynamic Convolution, Adaptive Attention
PDF Full Text Request
Related items