Font Size: a A A

Research On Image Captioning Algorithm Based On Attention Mechanism

Posted on:2022-06-12Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2518306548961479Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Image captioning algorithm is a technology that outputs short natural sentences describing the content of an image scene given an input image.This technology combines two different modalities of image visual information and natural language information.It requires a comprehensive use of computer vision technology and natural language processing technology to solve related problems.The image captioning algorithm can be regarded as a machine translation task in natural language processing,which "translates" the image information into a short description.The traditional recurrent neural network can deal with natural language information effectively,but because the next state of recurrent neural network depends on the internal state of the previous one,there is the problem of gradient disappearance or gradient explosion,which is insufficient for encoding semantic information with long distance.Then the encoder and decoder based on the Transformer structure can provide better performance than the recurrent neural network.Through the internal self-attention mechanism and multi-head attraction mechanism,the long-distance dependence of semantic information can be encoded well.It is better to abstract the semantic association between objects in the image,and parallel training can also be carried out.But there are some potential problems in the Transformer structure,which are mainly reflected in: 1)The encoder-decoder model based on the Transformer structure has very large training parameters.And the calculation of multi-head attention module in the Transformer structure fails to combine the feature information in multiple subspaces;2)The key matrix and the value matrix in the Transformer only encode the specific information of one image,and fails to encode the additional information of the image content,such as the positional relationship between objects.In response to the above problems,we propose a series of improvement schemes.The research content of this paper includes image captioning algorithm based on hybrid attention distribution and factorized embedding matrix,and image captioning algorithm based on prior knowledge Transformer.The main innovations of this paper are as follows:(1)Image captioning algorithm based on hybrid attention distribution and factorized embedding matrix.Aiming at the problem of a large number of training parameters in the Transformer structure,we study the word embedding matrix in the model occupies a large part of the parameters of the entire model.And then we propose a factorized embedding matrix to effectively reduce the parameters of the word embedding matrix.After decomposition,the degree of correlation between words has also been enhanced,and the computational complexity of the model has also been reduced which speed up the convergence speed of model training.For the calculation of the multi-head attention mechanism in Transformer,the mutual relationship between multiple subspaces cannot be considered.In order to better allow the model to mine the correlation between the vectors,we propose a hybrid attention distribution,the purpose is to make the model learn the potential associations between multiple subspaces during the training process,and the model’s ability to express the entire image information or language information is improved.First,the hybrid attention distribution will calculate the self-attention mechanism in Transformer in multiple subspaces to obtain multiple attention distribution weights,and then the model will use the training parameters in the hybrid attention distribution to calculate the weight sum of multiple attention distribution,which allows the model to dynamically pay attention to the relationship between the various subspaces.Finally,the model’s ability to express image information will be improved.Compared with the original Transformer,the performance of the model has been effectively improved.(2)Image captioning algorithm based on prior knowledge Transformer.The original Transformer calculates the multi-head attention mechanism based on the query matrix,the key matrix and the value matrix.These matrices are obtained from the image information through the projection matrix.They only encode specific information in the image.Some additional information is not considered by the model.Therefore,we propose a Transformer based on prior knowledge,which introduces additional vectors,key matrices,and value matrices to recombine them into more expressive key matrices and value matrices to participate in the calculation of the multi-head attention mechanism.In the model training phase,the additional vectors will learn the potential information of the image in the training set,such as the positional relationship between the objects in the image,and some common sense knowledge.In the decoding stage,the model can provide the decoder with richer and complete information to instruct the decoder to generate more accurate words,thereby improving the overall performance of the model.In summary,this paper studies the image captioning algorithm based on attention mechanism and proposes a series of improvement schemes for its design defects on the basis of transformer structure.At the same time,after experimental verification,the image captioning algorithm proposed in this paper is more excellent than the baseline model and the generated description sentence is more accurate and closer to the real description.
Keywords/Search Tags:Image Captioning, Transformer, Attention mechanism, Encoder, Decoder
PDF Full Text Request
Related items