Font Size: a A A

Research On Image Caption Based On Deep Learning

Posted on:2020-11-19Degree:MasterType:Thesis
Country:ChinaCandidate:H B WangFull Text:PDF
GTID:2518306050956969Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Image caption combines the knowledge of computer vision and natural language processing.It is a research hotspot in artificial intelligence.Unlike image understanding tasks such as image classification and object detection,image caption not only recognizes the objects in the image,but also understands the relationship between the objects,and using natural language express it.image caption is very challenging.Image caption builds a bridge from vision to natural language,and has broad application prospects in the fields of image retrieval,human-computer interaction,and intelligent monitoring.This paper improves the image caption model based on deep learning from the three aspects of attention mechanism,decoder and training method to improve the effect of generating description.The main research contents are as follows:1.An image caption model based on attention fusion is proposed.The spatial attention obtained by directly dividing the image may cause the attention mechanism to not accurately select the image features corresponding to the target.In order to solve this problem,this paper proposes to use Faster R-CNN as the encoder to detect the exact position of the target in the image,which is used to improve the accuracy of the attention of the space.At the same time,the name attribute corresponding to the target can be detected.The name attribute serves as a high-level semantic attention and spatial attention to guide the generation of word sequences.The experimental results on the MSCOCO dataset show that the performance of the image caption model based on attention fusion exceeds the spatial attention-based image caption model,and is superior to most mainstream image caption models,indicating the attention fusion based on Faster R-CNN.The image describes the validity of the model.2.An attention fusion image caption model based on convolutional coding is proposed.Recurrent neural networks cannot be parallelized to cause the model to be trained too slowly and lose information when processing long sequences.In order to solve the problem of recurrent neural network and improve the training speed of the model,this paper uses mask convolutional neural network combined with linear gating unit as the decoder of image caption model based on attention fusion.Convolutional neural network can process data in parallel and calculate more Efficient,its hierarchical structure is better able to capture and process complex relationships in sentences.The experimental results on the MSCOCO dataset show that the convolutional neural network decoding model is more than 1.5 times faster than the recurrent neural network decoder model,and the effect of generating the caption is similar,indicating the effectiveness of the decoder based on convolutional neural network and can improve the training speed of the model.3.The reinforcement learning method is used to further optimize the image caption model based on attention fusion.The use of cross-entropy loss training model has the problem of exposure deviation and inconsistency of measurement standards,which will result in the caption of the generated images being inconsistent with the image content and the evaluation indicators cannot be fully optimized during model training.In order to improve the problem of the caption based on the cross-entropy training,the effect of the image caption model based on attention fusion is further improved.This paper uses the reinforcement learning method to train the model.The model is consistent in the training and test input,and the evaluation score is used as the reward function.Firstly,the cross-entropy loss training model is used to stabilize the state,and then the REINFORCE algorithm is used to optimize the evaluation score in the reinforcement learning to further train the model.The experimental results on the MSCOCO dataset show that the intensive learning method can significantly improve the model's evaluation score.The score on the above indicates that the model based on the reinforcement learning training method can further improve the performance.
Keywords/Search Tags:Deep learning, Image caption, Attention mechanism, Convolutional neural network
PDF Full Text Request
Related items