Font Size: a A A

Deep Learning For Image Captioning

Posted on:2021-05-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Z MaoFull Text:PDF
GTID:1368330605481265Subject:Intelligent Science and Technology
Abstract/Summary:PDF Full Text Request
In the past 20 years,natural language processing and computer vision have made great progress in their respective fields.However,the morphological differences between textual and visual source of data has led to the relatively independent development of these two disciplines.In the past few years,with the rapid development of the mobile Internet,the ever-increasing of textual and visual data urgently requires a cross-research in these two disciplines,and gradually forms a new discipline,Cross-Media Intelligence(CMI).CMI has spawned many new research tasks and scenarios,in which image captioning is one important task.It aims to automatically generate natural language sentence describing an image.Recently,with the in-depth study of deep learning in CMI,image captioning has made great progress.The core idea of the existing methods is to learn an image conditioned language model relying on an encode-decode structure.However,the asymmetry of semantic information between the viusal and textual source of data is the issue.Existing methods generally alleviate this problem in two ways:1)Grasp the main content of an image,and describe it through a single sentence;2)Describe an image through multiple sentences to show more details of an image.This article proposes innovative solutions along these two lines,including:Using single sentence to describe an image,we proposed an Image Gate Unit Long Short-Term Memory model.An important feature for the task of image captioning is that a single sentence mostly describes only part of the image content.Therefore,models for single sentence captioning usually need to select image content before captioning.Our proposed model provides a feature-oriented content selection.Through data-driven approach,our model can automatically learn when to open or close the image gate unit,and cross-filter text and image features to achieve the feature-level content selection.Moreover,for the problem caused in baseline method that the image feature attenuates along with the increase of the time series step,we design a pulse feedforward mechanism.It will re-feed the image feature to the model at a certain frequency to ensure the supervision of the image feature to the generation of the post-position sentence.Experimental results in the three datasets show that our model can significantly improve the performance of image captioning.The effectiveness of the image gate unit on fusing image and text features is proved by comparing with various fusing methods.Using multiple sentences to describe an image,we proposed a Topic-Oriented Multi-Sentence image captioning model.The focus of the multi-sentence captioning is to find a content selection clue to organize the generation of multi-sentence,and to enrich the description of an image.To this end,we first propose to use topic as the clue,describing an image with topic-oriented multi-sentence.The concept of topic is different from visual attribute.It starts from the whole corpus and finds different topics or emphases on how a sentence describing an image through statistical machine learning.By representing a topic as a topic vector,our model can generate sentences with specific topic properties.Experimental results have shown that our topic-driven captioning model can make fully use of existing datasets(without additional annotation)to describe more image content.At the same time,by comparing a variety of methods in fusing topic vector,we prove that our model achieves better topical consistency.Aiming at the problem of the above proposed model,we proposed a Temporal Topic Attention driven Multi-Sentence(TAMS)image captioning model.Although the previous model provides an effective way for multi-sentence captioning,both theoretical and experimental analysis shows that there are several aspects that can be further improved,mainly from two aspects:1)The bag-of-words oriented topic model cannot learn the temporal information from the sentence.TAMS uses a temporal neural network,which retains the sentence structure information to a certain extent.Besides,TAMS uses a Gaussian Mixture model to learn text topics from the continuous features obtained from the temporal neural network;2)a topic words constructed topic embedding may involve words unrelated to a given image,which may result in a deviant description of the image.This model introduces the Topic guided Attention mechanism.Through contrastive learning,the recombined image feature guided by a topic is forced to approximate to the sentence vector of the same topic and alienated from the sentence vector of the other topic.In this way,the model can transform the different emphasis of text in describing an image from a topic into the relative importance of the block features in an image.When using topic-guided recombined image feature to generate descriptions,no irrelevant information is involved,which can effectively avoid generating irrelevant descriptions.Experiments show that the proposed model can achieve better multi-sentence description performance and better topic consistency.We design and implement an automatic image captioning demonstration system based on three-layer structure.By using the proposed models,we construct a demonstration system.The system provides two image input methods:local upload and camera capture,and receives two types of images:clothing and life.The system also provides two ways,single-sentence oriented and multi-sentence oriented,to generate descriptions for the two types of images.We have also designed a web crawler to collect image-title pairs of clothing data,which provides the necessary data foundation for researchers interested in in-depth research on the automatic generation of clothing titles.
Keywords/Search Tags:image captioning, deep learning, topic model, multi-modal language model
PDF Full Text Request
Related items