| Taking an image as input,automatically generating meaningful text description by computer,is called image captioning.Because of its location at the intersection of computer vision and natural language processing and its wide application prospect,more and more researchers are working to it.Image captioning is one of the research hotspots in recent years.The scene graphs annotate the semantic relationships between objects in the image.By generating the scene graphs of the image,we can introduce the guidance of the relationships between objects into the image captioning model to enhance the region-level features,which is conducive to reasoning out the correct text description.However,the existing scene graph generation models inevitably predict a large number of redundant and noisy relations,which has a great negative impact on the image captioning task.In order to effectively utilize the semantic relations that play a positive role in the generated description in the scene diagram and reduce the interference of noise relations,after constructing the scene semantic graphs of the image,a gated graph attention encoder is proposed in this paper,which combines the attention mechanism and the gated mechanism to automatically focus on the relations useful for generating descriptions and aggregate these relations to generate region-level features of relation perception.Specifically,the attentional mechanism assigns weights to a set of relationships in the input to distinguish useful and useless relationships.Gating mechanism re-evaluates the exploitable value of the relationship after attention so as to reduce the impact of redundant relationship on description generation.In addition,at the decoder for generating descriptions,a global adaptive attention module is designed,which makes comprehensive use of both global and region-level features to guide description generation.Finally,extensive experiments are carried out on the popular m S-COCO benchmark of image description generation dataset.Experimental results show that the proposed model is superior to the latest methods that introduce semantic relations to guide image description generation.The validity of each module in the model was verified by ablation experiments. |