Image captioning refers to the description of the content information of an image using natural language and involves the conversion of image information to text information.This task involves not only computer vision techniques but also employs techniques related to natural language processing.The image captioning model uses an encoder-decoder framework.In this image description architecture,the image is first converted into an intermediate feature vector by the encoder,and then the attention mechanism is used to assign weights to the feature regions of the image,followed by transferring the image feature vector to the decoder,which generates the description.The generic attention mechanism takes the operation of single-layer linear fusion of image feature vectors and text vectors,while the single-layer linear fusion has limited usefulness and is not sufficient for inference of image feature information at the deeper level of the image.In addition,the encoders of the image description model encode the scene graph all adopt the relational graph convolutional neural network,which only considers the information of its own nodes and the relatively important image nodes,ignoring the potential relationships existing in the nodes of the whole scene graph.Besides,the structure of the scene graph can represent the information of image,the image description models only use the node information of the scene graph.In this thesis,we propose the following improvements based on the encoder-decoder model.The details are as follows.The enhanced attention mechanism is proposed for the problem that the attention mechanism has limited ability to fuse the interaction of text feature information and image feature information.This mechanism combines the local attention mechanism and the global attention mechanism to achieve the consideration of detailed information under the global perspective,and the detailed feature information is grasped from the global perspective to extract higher-level feature vectors as the input of the long-and short-term memory network,aiming to enhance the interaction between text and image and improve the quality and evaluation index of the generated description sentences.The multi-mode encoder structure is proposed to address the problem that the encoder in the image description model does not consider the encoding of the image feature vectors to be updated under the global structure.The encoder integrates the information of each node of the scene graph,the neighbor node information and the node information of the whole image,and enhances the inference ability of the model and improves the quality of the generated description sentences by updating the information of the scene graph’s own nodes,neighbor nodes and the whole image through the encoder during the training process of the model.To address the problem that the generic attention mechanism cannot fully utilize the structural information of the scene graph,this thesis proposes the structural attention mechanism.The structural attention mechanism uses the structural information and node information of the scene graph to reason about the feature information of different levels of the image,calculate the similarity between text and image,and improve the quality of generated descriptive sentences.In this thesis,we conduct experiments on MSCOCO and Visual Genome datasets to evaluate the quality of generated descriptions by the image description model in this paper using CIDEr,BLEU,Meteor,Rouge metrics and SPICE metrics,and results indicate the effectiveness of the model in this thesis. |