Font Size: a A A

Research On Feature Fusion Strategies Of Attention Mechanism In Image Description

Posted on:2021-12-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:T JiangFull Text:PDF
GTID:1488306503982209Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
The main task of image description is to automatically generate captions for images to express their contents.The field of image description is an important branch of image understanding,connecting two important areas of computer vision and natural language processing.It has great prospects in applications such as blind navigation and image retrieval.As a research hotspot of artificial intelligence,attention mechanism plays an important role in the field of image description.The attention mechanism stems from the fact that the human visual system selectively focuses on important information while ignoring irrelevant information.The attention based encoder-decoder framework is currently the mainstream method for image description.The attention mechanism fuses the features extracted by the encoding network according to the historical states of the decoding network,and obtains the visual semantic features of the current step.The decoding network converts visual features into text outputs.Attention mechanism transforms static image features into temporal semantic features.It is essentially the process of feature fusion.Therefore,the image features fused by the attention mechanism at each step determine the accuracy of the image captions generated by the decoder.In the field of image description,the image data has the characteristics of non-sequence and spatial distribution.The image captions have the characteristics of sequence and integrity.However,the existing attention mechanisms focus on the semantic information of the images when fusing image features.Therefore,the following problems exist when generating image captions:(1)Repeated recognition of image features results in repeated descriptions,grammatical errors and ambiguity issues.(2)Omission of image features results in missing important information and insufficient description of the image contents.(3)Inaccurate feature location results in incorrect matching of target attributes.(4)Without considering spatial distribution characteristics of images results in incorrect recognition of the spatial positions and relationships of the targets.We attribute the above problems to that the feature fusion of the attention mechanism lacks consideration of its own historical states and image spatial characteristics.With regard to the above problems of the existing attention mechanisms,this paper focuses on the following aspects of the fusion strategies:(1)With regard to the repeated recognition problem,an attention coverage fusion strategy with semantic guidance is proposed.In the process of generating image caption,the attention mechanism selectively fuses image features.Since the attention mechanism does not consider historical selection information,repeated selection problem often arises.The attention coverage fusion strategy introduces coverage vectors,preserving the historical attending information of the attention mechanism.The image features are weighted by the coverage vectors,so that the features with large attention weights are given less attention in the subsequent feature fusion process.Since the image description is a sequential decision process,and the image features are redundant,not all features need to be selected by the attention mechanism.Consequently,the calculation of the coverage vectors is the difficulty of this method.We apply the LSTM network to model attention coverage.The model automatically learns the coverage pattern of the coverage vectors to image features.In addition,the global semantic information of image features provides contextual guidance for the attention coverage fusion strategy,making it focused more on features related to image semantics.(2)With regard to omitted description problem,the attention feature fusion strategy with external memories is proposed.The attention feature fusion strategy with external memories preserves information blocked by the decoder.Since information is input to the external memories at each moment,they need to be continuously updated.In addition,the information stored in the external memories needs to be returned to the attention model.These two points are issues that need to be considered.We add multiplicative forgetting gates and output gates for the external memories,controlling the internal information update and the information flowing from external memories to the attention model respectively.With the multiplicative controlling gates,the external memories automatically determine the update and output of information.The external memories provide memory capacity for the attention mechanism,which makes it integrate more important features and reduce the information loss problem.(3)With regard to the problem of inaccurate positioning for attributes of the targets,a reconstruction based weakly supervised attention feature fusion strategy is proposed to provide supervision for the feature fusion process of the attention mechanism.The attention mechanism predicts the attention weights according to the historical states of the decoding network to measure the importance of each feature.However,the feature selection lacks a supervision mechanism and it is difficult to determine the accuracy of the attention weights.Since there is no labelled correspondence between the texts of the captions and the image features,the supervision mechanism of attention weights cannot be directly established from the training data.There is semantic correlation between the label text and the image image features.Therefore,this paper adopts the weakly supervised learning strategy.The attention weights are reconstructed with the labelle text.The error between the reconstructed attention weights and original attention weights is minimized.Reconstructing attention weights provides weak supervision guidance,which makes the attention mechanism locate image features more accurately.(4)With regard to the problem of inaccuracy of image spatial relationships,a spatial relationship attention feature fusion strategy is proposed to learn the spatial characteristics of images.The attention mechanism uses a weighted combination to calculate the semantic features for images at each step,which causes the spatial characteristics to be submerged in the addition operation.The spatial relationship attention feature fusion strategy uses a fully convolutional neural network to extract features from weighted image feature sets,which helps to maintain the spatial distribution characteristics of image features.Because of the diversity of spatial relationships,it is necessary to correlate context information to determine the expression of spatial relationships between different targets.Therefore,we concatenate the states of the decoder with the weighted image features,and then extract the semantic features with the full convolutional neural network.In addition,the skip connection of the attention mechanism can alleviate the gradient vanishing problem and ensure the convergence of the model training.In this paper,the feature fusion strategies of attention mechanism are improved based on image description application.The effectiveness of the algorithms is verified on three public datasets,namely Flickr8 k,Flickr30k and MSCOCO.The keypoints of this paper are the feature fusion methods of attention mechanism,which can be extended to other research frameworks based on attention mechanism.
Keywords/Search Tags:feature fusion strategy, coverage vector, external memories, reconstructing attention weights, spatial relationship learning, image description, attention mechanism, deep learning, encoder-decoder framework
PDF Full Text Request
Related items