| Image captioning requires the generation of a fluent text description that is highly relevant to the image and conforms to human expression habits on the basis of a full analysis and understanding of the image content,in which a full understanding of the image content is the key to improve the quality of the description text.To understand the image content,on the one hand,it is necessary to feed high-quality image visual features into the model,visual features are divided into region features and grid features.Region features can provide high-level semantic information of the object,but can not cover the background of the image and the fine-grained information of the target,and grid features can make up for its shortcomings.On the other hand,these visual features need to be fully extracted,fused and utilized.Thanks to the self-attention mechanism in Transformer,the semantic information in the visual features of images can be mined through layers of iteration of the self-attention mechanism.The existing image captioning models have shortcomings in understanding the image content.Firstly,due to the defects of self-attention mechanism,they cannot make full use of visual features.Second,the current non-autoregressive image captioning model only uses one visual feature,and does not integrate the two visual features to achieve the purpose of complementary advantages.Based on the above problems,this thesis carried out the following work.(1)Aiming at the defects of self-attention mechanism,this thesis proposes bilinear multilevel self-attention mechanism,which includes improved bilinear self-attention mechanism and multilevel attention mechanism.The improved bilinear self-attention can not only explore and utilize the higher-order features in the input data to promote the multimodal reasoning ability,but also solve the internal covariate shift caused by layer upon layer superposition.In addition,the internal multilevel self-attention mechanism is used to solve the problem of forcing attention weight between features to mislead model reasoning.(2)In view of the deficiency of using only one visual feature in non-autoregressive image captioning models,this thesis improved the structure of the model by adding grid features to provide context and background information and fine-grained information of the target.In order to enable the model to represent relative location information,this thesis proposes a feature relation fusion module,which directly integrates geometric relative relations into visual features.In this thesis,quantitative experiments are carried out on MS COCO data set,and the experiments prove the availability and effectiveness of the proposed modules. |