The encoder-decoder framework based on deep learning has made great achievements in the field of image caption generation.Its essence is to use the convolutional neural network to mine and encode the information contained in the image,and then use the recurrent neural network to the code is converted into a logical and clear text description.This method can not only greatly excavate the information contained in the image,but also generate a picture description with correct syntax and a certain logic,which successfully improves the accuracy and practicality of image caption generation.However,most of the existing research only mines the semantic information inside the image,and it is difficult to accurately obtain the topological relationship between the categories and objects in the image.In the process of describing a picture,people usually consider different categories.And the topological relationship between the main objects.Therefore,this thesis proposes an image caption generation method based on spatial topological relationship.The main research focus and work content are summarized as follows:(1)In the image encoding stage of the image caption generation process,the convolutional neural network cannot extract the topological relationship between different categories.Based on this lack of information,this thesis introduces the topological relationship between different categories in the image to improve the quality of the generated caption.This article first defines the topological relationship between different categories in the image,processes the heat maps of different categories through the convolutional neural network,and obtains the topological relationship between different categories.Then,it encodes the topological relationship between all categories in a picture.The feature vectors of the image are co-encoded by the neural network into the process of caption generation.It satisfies the information description process of the topological relationship between different categories in the image caption generation process,and improves the quality of the generated caption.(2)Due to the wide variety of categories and the relatively small number of categories contained in each picture,this article introduces the global topological relationship of pictures according to the image itself to improve the caption quality.This thesis defines the encoding of the topological relationship between different objects in the image.By preprocessing the image,the bounding boxes of different objects in the image are obtained.The internal information of the different objects and the positional relationship between different objects are carried out.Encoding to get the encoding of the topological relationship between different objects.Then by using the attention mechanism to encode the topological relationship between different objects in the picture at different moments of caption generation,different weights are given,so as to accurately embed this information in the caption generation stage and make the generated caption content more perfect.(3)Finally,the article verifies the two methods proposed in the article on MS COCO and Flickr30 k datasets.At the same time,the experimental results are compared with those of existing algorithms.The results show that,compared with the traditional method,after introducing the topological relationship,the text description generated by the image is indeed improved on some evaluation criteria,and the description ability of different categories in the image and the degree of relationship between objects are enhanced.Individually verify the pictures of failure descriptions generated by traditional algorithms,and the descriptions generated by some images are far better than those generated by existing algorithms.Overall,the generated caption is more in line with human language logic. |