Font Size: a A A

Image Caption Algorithm Based On Graph Convolution Networks And Attention Mechanism

Posted on:2021-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q JiangFull Text:PDF
GTID:2428330647961944Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Image captioning is a technique of image-text fusion,which aims to provide a general description of the image content in natural language.It has broad application prospects in the fields of image retrieval,robot question and answer,children's education,etc.At present,the accuracy and vividness of description sentences generated by image description need to be improved.This paper studies and analyzes image captioning tasks based on graph convolution neural network and attention mechanism.The main work is as follows:First,an image captioning model D-ada based on Dense Net and adaptive attention mechanism is proposed.Considering that it is difficult to correctly extract the global features of an image in image captioning,most attention mechanism methods force each word to correspond to the image area,ignoring the phenomenon that words such as "the" in description text cannot correspond to the image area one by one.In this paper,an adaptive attention mechanism model with visual sentry is proposed.In the coding stage,Dense Net network is introduced to extract the global features of the image.At the same time,on each time axis,sentry gates are set through the adaptive attention mechanism to decide whether to use image feature information for word generation.In the decoding stage,the long-short memory network is used as the language generation module of the image captioning task.In this paper,Flickr30 k and COCO datasets are used to test the performance of the adaptive attention model.Experimental verification shows that the model proposed in this paper has obvious improvement in BLEU and METEOR evaluation standards.Secondly,an image captioning model GCN-ada based on graph convolution network is proposed.In view of the insufficient research on the use of visual relations in the coding-decoding image captioning framework,we infer this visual relation to enrich visual semantics,model the relation at semantic and spatial levels,and further improve the sentence quality generated by image description through visual connection enhancement image encoder.The GCN-ada model is improved on the basis of the D-ada model.Firstly,a group of salient image regions are extracted by using a dense connection network,and a directional semantic graph is established on the detected regions.Secondly,the graph convolution network is used to enrich the region representation with visual relation in the structured semantic graph and spatial graph.Finally,the relation perception region representation of each relation learned is input into the LSTM decoder with attention mechanism to generate sentences.Experiments on Flickr30 k and COCO image captioning datasets verify that the use of visual relations to enrich region-level representation proposed in this paper ultimately enhances the sentence quality of image caption generation.This paper designs a model D-ada based on adaptive attention mechanism to improve the problem of description text and image correspondence and improve the description effect of the model.Then,it is proposed to add a graph convolution network to the D-ada model to explore the relationship between image description objects and enhance the detailed description of objects in the image.
Keywords/Search Tags:image caption, Dense Net, adaptive attention mechanism, graph convolutional network, visual relationship
PDF Full Text Request
Related items