| 2D image description generation is a popular research in the field of artificial intelligence.It refers to the use of methods and techniques such as machine learning and deep learning to generate a textual description capable of describing visual content of images,involving computer vision and natural language processing.The research on key algorithm of 2D image textual description generation based on visual features aims to generate complete and grammatical-compliant textual descriptions through feature extraction,feature fusion,language generation and other algorithms and techniques to accurately describe the visual content of images.2D image description generation is of great significance for scene understanding.Human communication processes mostly rely on natural language,and enabling computers to describe our visual world will lead to a large number of possible applications,such as image retrieval,semantic visual search;visually impaired assistance of human-computer interaction;and the intelligent surveillance of road monitoring.In recent years,although the research in this field has made great progress,there are still some problems that need to be solved,which are manifested in the following aspects:(1)In the existing research,only the features of the image hierarchy or the coarse-grained attribute features are considered,resulting in the loss of important discriminating information.In the attention mechanism-based 2D image description generation methods,the premise that the attention mechanism can fully play its roll as a feature fusion algorithm is that sufficient visual features have been extracted;and in the attribute-based image description generation methods;and in the attribute-based image description generation methods,coarse-grained attribute features are usually used.The complementarity of object features and fine-grained attribute information is ignored.(2)In the existing research,the visual content of a given image is not considered to be understood in a coarse-grained to fine-grained manner,resulting in a lack of description of the fine-grained content of the image.Human understanding of visual content is usually in a coarse-grained to fine-grained manner.For a given 2D image,humans first try to quickly view the visual content of the image to obtain a general understanding;then,according to different purposed,search for specific sub-areas in the image and obtain the required visual information.However,this coarse-grained to fine-grained understanding does not appear in the existing artificial neural network-based 2D image description generation method.On the one hand,most deep neural networks-based image description generation method mainly consider extracting visual information that can express the subject matter of the image,and generating textual sentences for description;on the other hand,the generated description usually only describes the coarse-grained visual content in the 2D image,resulting in the loss of important fine-grained content.(3)In the existing research,the generated sentence lacks the description of the appearance attributes of the object.In the 2D image description generation method,the contribution of the attention mechanism feature fusion algorithm is that it can assign different weights to the input feature information,thereby merging multiple feature information.Therefore,the attention mechanism can only determine which feature information in input is important.Moreover,the output of the attention mechanism is usually a single,fixed-length feature vector,which causes some import attribute information to be weakened.In addition,the existing attribute-based image description generation methods ignore the role of the middle-level attribute information contained in the object,such as gender,age,color,and the like.Due to the above problems,the research on key algorithms of 2D image textual description generation based on visual features is still a very challenging topic.In response to the above problems,this paper has carried out the following research work:(1)The complementarity between object features and attribute features,and the fusion of the attention mechanism.In the aspect of visual feature extraction,the global image feature,object features and attribute tag information are extracted from the 2D image.In terms of feature fusion,considering the attention mechanism as the feature fusion algorithm to extract the object features and attribute features,and the attention mechanism-based attribute-object fusion algorithm is proposed.Firstly,the extracted global image feature is input into the language model to obtain a general understanding of the visual content;and then the attention mechanism is used to fuse the object features and the attribute tag features to obtain important discriminant information.An attention mechanism-based attribute fusion algorithm is proposed to verify the complementarity between object features and attribute features.A mean-based attribute-object fusion algorithm is proposed to verify the effectiveness and robustness of the attention mechanism.(2)Realize the coarse-grained to fine-grained understanding of different visual information by the language model,and the hierarchical generation of image descriptions.In terms of visual feature extraction,attempts are made to extract visual features of different granularity,including coarse-grained global image features,image sub-space feature maps sets,and fine-grained object features and attribute features.In order to enable the language model to simulate the way humans understand visual scenes,a sequential dual attention mechanism is proposed to be used as feature fusion algorithm for different granularity visual information.Firstly,the global image features are input into the language model to obtain a general understanding of the visual content;then the spatial attention mechanism is used to fuse the extracted sub-space feature map sets;finally,based on the general understanding,the object attention mechanism is used to fuse the object features ant attribute tag features to obtain an understanding of the details of the image.(3)The modification effect of the middle-level attribute information on the object.In order to avoid the attention mechanism to weaken the middle-level attribute information of the object and improve the accuracy of the sentence in the description of the appearance,a middle-level attribute-based language retouching algorithm for 2D image caption generation is proposed.In the visual feature extraction stage,the VGG16 convolutional neural network is used as the basic classifier,and model training is performed on different data sets respectively to obtain multiple classifiers for extracting human object attributes and non-human object attributes;and then using faster R-CNN model extracts the object features and the corresponding bounding-box,and the bounding-box is used to extract the middle-level attribute labels.In the process of generating image description,the extracted middle-level attribute labels and the corresponding object labels are reorganized to generate a phrase capable of describing the appearance characteristic of the object.Finally,the transitional image description sentence generated by the language model is retouched by the method of retrieval and substitution,which effectively improves the description accuracy of the final sentence.Through the verification of the experimental results of public data sets and different evaluation methods,the following conclusions are drawn:(1)The proposed attention mechanism-based attribute-object fusion algorithm verifies the complementarity between object features and attribute features,and the attention mechanism as the effectiveness and the attention mechanism as the effectiveness and robustness of the feature fusion algorithm.(2)The proposed sequential dual attention mechanism-based 2D image textual description hierarchical generation algorithm makes full use of different granularity visual features,which effectively avoids the loss of the fined-grained visual content.(3)The proposed middle-level attribute-based language retouching algorithm for image caption generation realizes the modification effect of the middle-level attribute information on the object individual,and avoids the weakening of the object appearance attribute by the attention mechanism. |