Font Size: a A A

Research On Image Captioning Based On Deep Laerning

Posted on:2020-02-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:X X ZhuFull Text:PDF
GTID:1368330572472282Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Image captioning is a multimodal task that can generate text describing image content for input images accurately.In addition to the use of computer vision technology,the task also requires the use of natural language processing technology.Converting image content into text describing image content can further establish the semantic relationship between objects in the image and enhance the further understanding of image content.The framework of encoder and decoder based on deep learning provides a set of solutions for image description tasks.However,the existing methods still have some problems,such as:1)insufficient utilization of spatial information of images;2)biased training and testing stages exist at the training stage by cross-entropy loss function;In the image captioning model of recurrent neural network,there is still a problem of low parallelism at the training stage.To solve the above problems,we proposed a series of improvement methods.The main research contents of this paper include image captioning based on multi-attention mechanism and parallel stacked recurrent neural network,image captioning based on Word-Gate and adaptive self-critical sequence training,and captioning transformer based on stacked attention mechanism.The main innovations of this paper are as follows:(1)Image captioning based on multi-attention mechanism and parallel stacked recurrent neural network.To solve the problem of how to better understand the image content in image captioning task and make the descriptive text use the local information of image more effectively at different times,we propose the image captioning model based on the multi-attention mechanism.The traditional attention mechanism only considers how to fuse the local information of the image at the input stage of the long short-term memory network.In order to fuse the local information of the image at the multi-stage,we propose a multi-attention mechanism image captioning model.This improvement considers that the local feature information of the image is fused at all stages of the long short-term memory network.At the word generation stage,the proposed method can make more efficient use of image local features.In addition,we also take into account the shortage of long short-term memory network for historical information modeling.We propose a multi-attention method using the historical information of its own words and the semantic information of the image itself for modeling.Based on the improvement of the three attention mechanisms,the performance of the proposed method is improved compared with the traditional image captioning benchmark methods.In this paper,a new parallel stacked long short-term memory network is proposed to replace the original stacked long short-term memory network method,and a series of experiments are carrried out.Compared with the traditional stacked long short-term memory network,the performance of this model has been effectively improved.(2)Image captioning based on Word-Gate and adaptive self-critical sequence training.Image captioning task is a word sequence generation task.Because the traditional training method based on the recurrent neural network is based on the cross-entropy loss function,it will lead to the inconsistency problem between training and testing stages.In order to solve this problem,we propose an enhanced learning method based on the reward function of historical information.Compared with the previous methods,the training benchmark of this method is more stable.This method is different from the training based on the cross-entropy loss function.It is an effective training through a reward function.The reward function takes into account the similarity between words.In image captioning,the possible selected words are included in the whole dictionary,which leads to a higher dimension of action selection in reinforcement learning.In order to reduce the difficulty of the action selection,a mechanism based on Word-Gate is proposed.Through this mechanism,the model can effectively reduce the space of action selection,so that the generation of the model is equivalent to the selection within a more accurate word range.This method can be further understood as reducing the scope of choice of action space in reinforcement learning,which is conducive to better training of reinforcement learning.In addition,through comparative experiments,the proposed method can get better performance.(3)Captioning transformer based on stacked attention mechanism.Image captioning is based on the recurrent neural network.Current words need to be generated one by one using words generated by historical information.At the training stage,the model needs to wait for the historical words to be generated one by one before the current time words can be trained,which will lead to the problem of historical dependence at the training stage.In order to improve the training method,an image captioning model based on stacked attention mechanism is proposed in this paper.Compared with the traditional image captioning model based on recurrent neural network,this model includes the multi-head attention model and the self-attention model.These modules included in the model can effectively carry out parallel training.We use a method similar to the deep convolution neural network,stacking more multi-head attention models and self-attention models,and introducing the residual mechanism,so that the deep network can also be effectively trained.In order to train the multi-level model more effectively,a multi-level supervised training method is proposed in this paper.This method enables different levels of the model to have the ability to output semantic information.Finally,the average pooling layer is used to fuse the output of each level.Compared with the traditional top-level optimization method,the model can get better performance.In summary,this paper studies the image captioning based on deep learning.To solve the problems of the existing methods,we propose a series of improvement methods.The experiments show that the proposed image captioning algorithms can effectively improve the performance of the image captioning compared with the traditional method,solve the defects of previous image captioning algorithms,and the generated image description text is more accurate.
Keywords/Search Tags:image captioning, deep learning, image understanding, computer vision
PDF Full Text Request
Related items