Font Size: a A A

Research On Image Caption Algorithm Based On Fusion Of Multi-attention Mechanism

Posted on:2022-10-13Degree:MasterType:Thesis
Country:ChinaCandidate:L GuoFull Text:PDF
GTID:2518306734957169Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Image caption involves two major research fields: computer vision and natural language processing.It has always been a relatively complex research task.With the fast progress of deep learning,lots of researchers are following with interest of the image caption tasks that has become a research central issue.When given an image,the natural language describing the image can be automatically generated according to the image content algorithm.This task has strong practical application value in the fields of image-text search and image-aided understanding.In recent years,how to efficiently use image features to generate more accurate semantic descriptions has become the main research direction of image caption tasks.Based on the current image caption method,this thesis proposes spatial attention,channel attention and self-attention in the encoder part.At the same time,we employ adaptive attention in the decoder part,and carries out improvement research from the encoder and the decoder respectively.Firstly,this thesis utilizes spatial attention and channel attention into the existing encoder,and proposes an image caption model that combines spatial attention and channel attention.The channel attention is employed to decide the target object in the image when generating the current word.Different channels have different activation areas,that means only part of the channels will be activated when predicting a word.The spatial attention mechanism is used to determine the position information of the image target.This thesis adopts a convolutional spatial attention mechanism.The model maintains the spatial structure of the image.At the same time,it uses a larger receptive field to accurately determine the area that should be paid attention to at each step,so that the model focuses on the main information and ignores the secondary information.This section merges the spatial attention and the channel attention,then we use the attention mechanism through two different dimensions to generate image features with the weight of the attention mechanism.This method enhances the encoder's ability to extract features.Explain the results of the experimental data through the MSCOCO data set that the fusion of spatial attention and channel attention models have significant exaltation in the evaluation criteria of BLEU,METOR,ROUGE and CIDEr.Secondly,this thesis introduces positional self-attention and channel self-attention into the encoder,and proposes an image semantic description model that combines positional selfattention and channel self-attention.The image features of the image captioning model are extracted by the classic CNN,and there is insufficient utilization of global features.This section introduces the self-attention mechanism to adaptively integrate local features and global dependencies.Among them,the positional self-attention mechanism selectively gathers the features of each location through the weighted summation of the position.The channel self-attention mechanism selectively emphasizes a feature map through the channel feature.Model merges the positional self-attention mechanism and the channel self-attention which generates image features with the weight of the self-attention and improves the expressive ability of the model.Explain the results of the experimental data through the MSCOCO data set that image caption model fused with position self-attention and channel self-attention have significant improvement in the evaluation indicators of BLEU,METOR,ROUGE and CIDEr.Finally,this article introduces an adaptive attention mechanism in the decoder part.The decoder plays a vital role in the image semantic description model.Most of the existing decoders use long and short-term memory networks,because LSTM networks have longterm memory.To solve the problem of limited storage capacity,this section introduces an adaptive attention mechanism to decode image features,where the adaptive attention mechanism helps the model to pay more attention to text information when generating nonvisual words,and to pay more attention to images when generating visual words Information,thereby improving the accuracy of the image semantic description model.Through the comparison of the experimental data of the MSCOCO data set,the image semantic description model of adaptive attention has a significant improvement in the evaluation indicators of BLEU,METOR,ROUGE and CIDEr.Figure 32,7 tables and 70 references.
Keywords/Search Tags:deep learning, attention mechanism, image captioning, computer vision
PDF Full Text Request
Related items