Font Size: a A A

Research On Semantic-Attentive Deep Image Captioning Method

Posted on:2020-07-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:S W WangFull Text:PDF
GTID:1488306548492424Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Image captioning refers to using the computer to automatically describe a given im-age content with natural language sentences.This requires the computer to comprehensively-detailed understand image content at first,namely,image features are learned to effec-tively represent the objects in a given image,their attributes and the relationship between different objects.Then,the learned visual features are translated into a natural language sentence with correct grammars and logics.Among them,understanding image content belongs to the computer vision field,while natural language description is an important task in natural language processing.Thus,image captioning involves two fields,i.e.,computer vision and natural language processing.With continual boom of deep learning,deep neural networks are widely applied in various fields such as computer vision and natural language processing.No surprisingly,neural network-based image captioning models have also start to develop rapidly,most of which are based on an encoder-decoder architecture.In this architecture,the encoder uses the convolutional neural network(CNN)to encode the image content as image features,while the decoder employs a recurrent neural network(RNN)to translate the encoded image features into text descriptions.Existing studies advance the encoder and decoder to boost image captioning performance owing to two key motivations: 1)whether correctly understanding the image content will affect the subsequent decoding procedure of image captioning,thus enhancing the encoder is important for capturing comprehensive and rich image information? 2)The decoder is responsible for translating the learned image features into text descriptions,which focuses on how to effectively organize image features to generate correct text descriptions close to human languages and how to cut down the semantic gap generated during image-to-language conversion.To this end,this thesis introduces such strategies like high-level semantic concepts,attention mechanism,etc.,beneficial for understanding image content,to advance the encoder or decoder.The goal is to defeat the issues about insufficient semantic information and attention defocus in the model for better captioning performance.The main content and innovation of this thesis are fourth-fold:1.Object-aware semantics of attention for image captioning.Existing image cap-tioning models have verified that we can promote captioning performance by extracting image context information and spatial information.However,they all ignore the fine-grained object information in an image which is vital for understanding and depicting image.To better understand image content and generate accurate image descriptions,this thesis study how to explicitly mine the fine-grained information of the objects and their associations.By using the pretrained object detector,we construct three types of object-aware semantics,i.e.,object category,relative size of objects and their relative distance.In detail,1)to describe the counts of the objects in an image,we construct a category matrix,which stores the number of objects each category.This empirically ensures the generated descriptions to be correct? 2)the relative size of objects implies their roles in an image,thereby avoiding the theme deviation? 3)intuitively,the relative distance among the objects to some degree reflects the associations among the objects.Experiments on benchmark show that the proposed semantics is able to guide the attention module to pre-dict the corresponding texts.In contrast with existing semantics-based captioning model,the proposed semantics greatly improve the accuracy of the predicted descriptions.2.Cascade semantic fusion for image captioning.To make the encoder under-stand image content effectively,previous works focus on how to extract or construct dif-fere levels of image features.But they achieve satisfactory performance in the way that amounts of artificial experiements and feature ensemble schemes are required.This the-sis intends to adopt the cascade deep networks to learn semantics in a self-learning way while to fuse different levels of information,thus uncovering the fine-grained,global image context and spatial information.Furthermore,the learned information guides the decoder to induce accurate image descriptions.In addition,the semantic-aware attention module is introduced to reduce negative effect of distractors and effectively improve the representation ability of the learned features.In terms of ablation studies,different lev-els of visual features have global and local effect on text descriptions.Experiments on benchmarks demonstrate cascade architecture based image captioning model significantly achieve performance gains.3.Gate Cap: gated spatial and semantic attention model for image captioning.In the encoder-decoder framework,the mainstream of existing image captioning considers how to utilize the encoder to capture semantic information in the image,but few foucs on the potential capacity of the decoder,especially how the decoder converts image features into natural text descriptions close to human language.To this end,this thesis proposes a triple LSTM to highlight spatial and semantic attention features related with the task in a divide-and-fuse learning manner.In the decode process,the decoder employs a context gate module to predict the current word with high-level semantics in a fair manner,where reweighing spatial and semantic attention features can adaptively adjust salient regions to effectively align image regions with correct texts,further reducing the semantic gap between them.By experimental comparisons,it is noted that the context gate module can effectively alleviate the problem of exposure bias in most existing captioning models.4.Head Cap: a hierarchical encoder-and-decoder image captioning model.Ex-isting studies rarely advance both the encoder and decoder,simultaneously,to improve captioning performance.It is important how to novelly couple the encoder with the de-coder by adopting semantics to match image regions with texts,thereby reducing the side effect of attention defocus on the model.To this end,this thesis proposes a simple and effective hierarchical encoder-and-decoder,which is the first to aggregate different convo-lutional layer features into image captioning model and allows us to progressively collect image information across different levels in a hierarchical fashion.Based on such hierar-chical features,a multi-attention LSTM module is explored to conduct multi-level feature fusion,which performs multi-attention mechanism for each word to reduce the risk of mismatch between visual features,thus boosting the association between each other.
Keywords/Search Tags:image captioning, encoder-decoder, high-level semantics, object detection, attention mechanism, context gate module, attention defocus
PDF Full Text Request
Related items