Font Size: a A A

Research On Image Captioning Method Based On Deep Learning

Posted on:2020-07-04Degree:MasterType:Thesis
Country:ChinaCandidate:X X LiuFull Text:PDF
GTID:2428330572477742Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Image captioning is a hot topic of image understanding,and it is composed of two natural parts which correspond to the two most important fields of artificial intelligence "machine vision" and "natural language processing".With the development of deep neural networks and better labeling database,the image captioning techniques have developed quickly.Currently,the most widely used image captioning model is the end-to-end network model which combines with Convolution Neural Network(CNN)and Recurrent Neural Network(RNN)to generate caption for images.However,this method also has the disadvantage of incomplete description of the image.For example,the region based dense captioning has redundant descriptions which are independent and unrelated,and the general description with a single sentence still has the problem of incomplete content.Therefore,in order to overcome these problems,this thesis studies a joint model that combines high-level conceptions(obtained by dense captioning of local areas)and image feature through an attention mechanism,and proposes an improved algorithm of reasonable fusion of local text suggestion boxes to construct global text graph.These achieve the description of the image with a concise sentence or multiple sentences on the basis of grasping accurate and substantial image content.First of all,this thesis starts from the background and the theoretical and practical significance of image captioning in the field of artificial intelligence.and introduces the research status and existing problems in this field at home and abroad.Secondly,aiming at the problem of the incomplete image description,this thesis extracts the global feature of the image and high-level semantics from local regions,and fuses the semantic information to guide the generation of image description.Therefore,the improved model can grasp the global information to generate the overall description,and pay attention to the detail of the image to enrich the image description,and make the image captioning more comprehensive,which achieves a combination of top-down and bottom-up models.Simultaneously,a mechanism of attention is introduced to simulate human visual attention to guide sentence generation.The attention mechanism assigns different confidence to high-level conceptions according to the words generated at the previous step.Therefore,the high-level conceptions reflecting the local information can be better integrated into the process of text generation.and makes the description more comprehensive and accurate.In addition,a local region text box fusion method is proposed based on the dense captioning.Global text graph is constructed by combining the local description,and then integrate different objects according to the intersection ratio and positional relationship of region boxes.This enables each part of regions to be connected,and it is possible to integrate multiple partial descriptions into one or more sentences as a whole by removing the redundant dense description.Finally,the above model is built using the torch framework.The VGG-CNN is used as the encoder to extract the image features,and the LSTM-RNN is used as the decoder to generate the description.The model was trained by the Visual Genome and Microsoft COCO dataset,and multiple evaluation indexes are used to test the model under Microsoft COCO,Flickr30K dataset and randomly downloaded images.Experiments show that the proposed model can generate a more comprehensive captioning and the expression of the language is clear,logical,and without repetition.
Keywords/Search Tags:deep learning, image captioning, high-level conceptions, attention mechanism, regional box fusion
PDF Full Text Request
Related items