| In the era of internet,massive amounts of data are generated by intelligent devices,such as text and image.Image has direct-viewing,easy to understand and other characteristics.Compared to images,text gives a traditional and concise way to express and exchange information.Image captioning is an attractive technology which can automatically generate natural sentences to describe the content of corresponding images.It has been widely used in human-computer dialogue,image-text matching and other applications.Compared to object detection task,image captioning can not only generate the description of objects but also more precisely describe the attributes of objects and relationships between objects in an image.Therefore,how to mine visual semantic words and how to establish their correlation are great challenges in image captioning.The contributions are summarized as follows:(1)We proposed a new semantic graph construction method.There are three kinds of nodes to describe important attributes and relations among objects in the image.Three deep learning based detection models are used to detect and recognize these visual features about objects in the scene.Specifically,Faster-RCNN was used to detect objects in an image.The object word is mapped into the feature embedding vector of the object node in the semantic graph.We use an attribute detector to predict attributes of an object,which is a simple multi-layer perceptron network followed by a softmax function.We used an independent and trainable word embedding layer to encode feature embedding vector as an attribute node.A Bi LSTM was used as a relationship detection model,which can predict the relationship by combining objects and visual region features.Relationship words are encoded with the same way of attribute node.The edge between nodes is represented by the form of a matrix.Specifically,we use object nodes and relationship nodes to construct a three-tuple of matrix.The two-tuple of matrix is composed of object nodes and attribute nodes.The object nodes join the n-tuples to represent the semantic graph.Finally,the graph convolutional network is used to enhance the representation of nodes with visual relationships in the semantic graph.(2)We proposed a semantic sentinel mechanism for image captioning.The sentinel mechanism helps the model to choose visual scene or semantic graph to obtain the next word.Visual features were used to describe some visual related low level features such as some salient regions and spatial information of objects in an image.Semantic graphs show some high-level features such as attributes and relations between objects in detail,which is more consistent with the natural language sentences.At the same time,there is some interfering information in the semantic graph.Therefore,it makes sense for the model to choose visual scene or semantic information to generate the natural sentence.Specifically,the sentinel mechanism is composed of the gating unit of the language model LSTM and the adaptive attention.The gating unit uses the semantic information and the words,memory units of the previous moment of the LSTM to calculate the sentinel vector.The adaptive attention decides to use visual information or sentinel vector to participate in the generation of sentences through the sentinel gate.(3)We conducted extensive experiments to evaluate the application of semantic graphs and sentinel mechanisms.The effectiveness of the three types of nodes in the semantic graph and the semantic sentinel mechanism in image captioning are evaluated by ablation experiments.The performance of the proposed method is tested on the MSCOCO dataset. |