Font Size: a A A

Study On Image Captioning Based On Deep Learning

Posted on:2021-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z ZhangFull Text:PDF
GTID:2428330611450559Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
With Alex krizhevsky put forward the deep convolution neural network model in the2012 ImageNet competition and won the championship in that year,a spree of artificial intelligence research had been stirred up again.As one of the important domains of artificial intelligence,computer vision is also developing rapidly with the establishment of deep learning model.There are a variety of images in modern science and technology life,most of which are not described in specific language.It is easy to understand them for human people,but for machines,it is quite difficult to describe the images comprehensively.Image captioning task is to input a picture and output a natural language description of the picture.It is a combination of computer vision and natural language processing task.That is undoubtedly more challenging than the traditional object detection and segmentation,because the algorithm not only needs to detect objects,but also needs to understand the relationship between objects,and then describe it in natural language.So far,there are still many problems in image captioning:(1)convolutional neural network is the main image feature extraction method in computer vision,but it can't get the relationship between image objects and their hierarchical interaction;(2)recurrent neural networks and its extension LSTM,GRU,etc.have become popular and effective cross domain sequence data modeling framework.In the image captioning task,the sentences of image description generated by recurrent neural network are too simple and there is no reasoning in the generation process;(3)the attributes of images are too few,resulting in the sentences of image description are not specific.The main contents and contributions are summed up as follows:(1)We propose image captioning based on graph neural network(GCN).It considers the hierarchical interaction between different levels of abstract visual information in image and its bounding box.We use GCN in encoder to extract image featureinformation,and then input the extracted information into decoder output the image caption,this model has achieved good results in experiments.(2)Beam search is a kind of approximate reasoning algorithm widely used in decoding sequence of unidirectional neural network model.Because the sentences of the generated image caption are too simple and can't specifically highlight the focus of the image,we use the beam search fusion attention mechanism to generate the image caption.Experiments show that this method makes the image caption task have certain reasoning.(3)For the traditional image caption task output image captioning has no specific description of image content,and the generated statement description is incomplete and single.We use the idea of Generative Adversarial Networks to generate image caption,which makes the generated image caption flexible.The experiment proves the effectiveness of this method.
Keywords/Search Tags:Deep Learning, Image Captioning, Convolution Neural Networks, Recurrent Neural Networks, Scene Understanding
PDF Full Text Request
Related items