| With the rapid development of artificial intelligence,image caption tasks have become an important research topic as the key technology for realizing human-computer interaction.This task requires not only accurate identification of the content in the image,but also the understanding of the relationship between the various objects in the image,which is a very challenging task.The encoder-decoder image caption framework based on deep learning greatly improves the algorithm performance.However,there are still the following deficiencies:(1)In the process of using long-short-term memory networks to decode and generate images to caption,image features will be compressed into a one-dimensional space,destroying the spatial structure in the original image.At the same time,the traditional attention mechanism is to obtain a weight matrix by training to obtain the key areas of the image for attention,which will cause the network to pay too much attention to the local information and ignore the global information of the image.(2)Under the existing encoder-decoder framework,the decoder uses a recurrent neural network and its variants,which is a unidirectional sequence structure model.In the process of generating description sentences,the model can only make predictions based on the words that have been generated,and cannot effectively pay attention to the reverse context information.Therefore,in view of the above problems,this paper carried out research on image description algorithms based on deep learning.A series of improvement measures are proposed.The main research contents and contributions of this thesis include:(1)A multi-attention mechanism image caption algorithm based on multi-dimensional hidden spatial structure is proposed to keep the original spatial structure of the image during the decoding,so that the attention mechanism can better grasp the key information in the image.Later,a cross-channel attention mechanism was added to supplement the global information of the image.(2)An image caption algorithm based on bidirectional context information is proposed,which is supplemented by adding a backward decoder to obtain reverse context information,and is automatically input to the forward decoder through an attention mechanism.Then two adaptive gate control mechanisms are designed to reduce the interference of image features and noise in the reverse context information.In summary,this paper studies the image description algorithm based on deep learning.A series of improvement schemes were proposed for the defects of the existing encoder-decoder framework,and they were verified on the public datasets MS-COCO and Flickr-30 K,and achieved good results and performance. |