Font Size: a A A

Research On Visual Question Answering Method With Attention Reasoning Mechanism

Posted on:2023-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y HanFull Text:PDF
GTID:2558306941997029Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of computer vision and natural language processing,artificial intelligence has ushered in another wave.In recent years,cross-modal tasks combining vision and language have attracted wide interest from researchers.Among them,visual question answering is a challenging cross-modal fusion task,which can fully understand the image and natural language form of the question,then semantic alignment of the captured features of the two modes,and finally achieve fine-grained fusion reasoning and answer prediction.At present,researchers have applied attention mechanism to visual question answering task well,and made breakthrough progress,but in modal fusion,often due to lack of key information or weak reasoning ability,leads to the error of answer prediction.Therefore,the visual question answering task requires more fine-grained extract image features and question features,and fine-grained reasoning.Aiming at the existing problems,this paper proposes three novel model frameworks and makes some improvements to the attentional mechanism model.The main research contents of this paper are as follows:Firstly,the research of visual question answering based on deep interactive attention network is carried out.In previous studies,researchers usually used LSTN network or GRU network to encode problems,which is word-level and will lead to the loss of some key information in problems.At the same time,the image features and problem features are not well processed during modal fusion,which leads to the introduction of a lot of noise.Based on this,this paper proposes a deep interactive attention network,which uses self-attention units to process feature information and carry out semantic feature alignment.Meanwhile,an improved one-dimensional convolutional network is used to further capture phrase features in the problem.The model can effectively extract and screen feature information and conduct deep interaction,which improves the reasoning ability of the model and improves the accuracy of answer prediction.Second,the research of visual question answering based on deep reasoning attention network is carried out.Effective feature information is the prerequisite for improving reasoning ability.A deep inference attention network is proposed by using the Transformer model and its variant model(Transformer model is improved with residual idea).This network uses the Transformer model to encode problems and capture complex relationships between words.Meanwhile,it uses memory network to store images and key information in problems,which can further assist model reasoning in modal fusion.A deep reasoning attention model can effectively combine attention mechanism,memory and reasoning,which enables the model to carry out deeper reasoning.Thirdly,visual question answering based on relational reasoning attention network is studied.In the above two works,the reasoning ability of the model for complex problems is still weak,and the rich semantic information and spatial information in the image are ignored.Therefore,a relational inference attention unit is proposed in this paper.The Faster R-CNN model is used to extract target features and candidate box features from images,and the candidate box features are regarded as a method.The three modes of question feature,target feature and candidate box feature are simultaneously input into the improved Transformer model,and the higher-level information features in the image are captured through deep interaction,so as to carry out fine-grained reasoning and answer prediction.The test results on the benchmark data set show that the proposed three attentional reasoning network models can capture image features and question features well,and perform fine-grained fusion reasoning,which further verifies the validity of the model.
Keywords/Search Tags:Visual Question Answering, Attention Mechanism, Deep Reasoning, Relation Reasoning, Transformer Model
PDF Full Text Request
Related items