Font Size: a A A

Research On Multimodal Attention Mechanism And Information Fusion For Visual Question Answering

Posted on:2020-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:M R LaoFull Text:PDF
GTID:2518306548995539Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
As a cutting-edge task in the field of multi-modal information processing,visual question answering(VQA)requires a joint understanding of visual and textual information to infer the correct answers for different image and question pairs.The task study is of great significance to promote the development of multi-modal deep learning theory,especially in the field of multi-modal attention mechanism and information fusion.At the same time,the deep learning framework with high accuracy in visual question answering can not be widely used in social and military applications.Firstly,the research background and significance of visual automatic question answering task are elaborated,and the basic definition and content of the task are introduced in detail.This paper studies the two core parts of the task:multi-modal attention mechanism and multi-modal information fusion.While systematically summarizing the relevant research techniques,some novel and efficient methods are designed and innovated.The main contents and innovations of this paper are as follows:(1)As for multi-modal attention mechanism,this paper proposes a novel multimodal co-attention mechanism,that is,the two-way multi-modal attention mechanism which combines the whole-meaning-guided text attention mechanism with the problem-oriented visual attention mechanism.After introducing the background and basic definition of this method,this paper proves the validity of this method through quantitative experimental analysis and qualitative image visualization analysis.(2)With regard to multi-modal information fusion method,this paper innovatively takes multi-step fusion as the design orientation,and proposes a cross-modal multi-step fusion network.On the one hand,the cross-modal multi-step fusion unit can achieve efficient word information fusion;on the other hand,benefiting from the network's cyclic residual structure,this method can increase the number of visual information and text information fusion without linearly increasing the use of learning parameters.Through experimental analysis,the learning framework achieves a high accuracy rate of question and answer by combining the method with attention mechanism.(3)In order to preserve the second-order operation of bilinear pooling method and reduce the use of computing resources,a bilinear pooling method based on multi-modal local perception is proposed in this paper.The second-order operation between two high-dimensional features in the original pooling method is transformed into a feature core of modal information and a bilinear pooling operation of another modal local feature.At the same time,the method uses weight sharing operation,i.e.local bilinear pooling method to share parameters.Through this design,this method can realize the complex interaction between visual and text information,and save a lot of computing resources.Through experimental analysis,this method can really achieve the efficient fusion of multi-modal information.
Keywords/Search Tags:Visual Question Answering, Deep Learning, Information Fusion, Attention Mechanism
PDF Full Text Request
Related items