Font Size: a A A

Research On Visual Question Answering Method Based On Deep Learning

Posted on:2019-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:L F CaoFull Text:PDF
GTID:2348330563953958Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of computer vision and natural language processing,visual question answering(VQA)has become one of the increasingly popular study field in deep learning.In the field of natural language processing,question answering system based on language has been studied extensively,and has made tremendous achievements,however,involved in visual question answering system hardly study.Visual question answering system is an inter-disciplinary research field,its main purpose it to automatically answer a natural language question based on visual content(image or video),which is one of the future focused research directions in the field of artificial intelligence.By simulating real-world scenarios,visual question answering system can help visually impaired users perform real-time human-computer interactions,which is also the future of visual question answering.VQA was originally derived from the Turing test,and the study of VQA based on deep learning is a hot area that has just emerged in recent years.Research on deep learning has drawn increasing attention,such as large-scale image retrieval based on deep hashing,which you can quickly find similar images in millions of images.Visual question answering system based on deep learning is as a new research direction,and we still have a lot of thing to learn and mine,meanwhile the challenges we face are going to get bigger.A number of recent studies are focusing on proposing attention mechanisms such as visual attention(“where to look”)or question attention(“what words to listen to”),and they have been proved to be efficient for VQA.However,they are focusing on modeling the prediction error,but ignore the semantic correlation between image attention and question attention.As a result,it will inevitably result in suboptimal attentions.In this paper,we argue that in addition to modeling visual and question attentions,it is equally important to model their semantic correlation to learn them jointly as well as to facilitate their joint representation learning for VQA.In this paper,we propose a novel end-toend model to jointly learn attentions with semantic cross-modal correlation for efficiently solving the VQA problem.Specifically,we propose a multi-modal embedding to map the visual and question attentions into a joint space to guarantee their semantic consistency.Experimental results on the benchmark datasets demonstrate that our model outperforms several state-of-the-art techniques in visual question answering.In addition,existing approaches predominantly predict the answer by utilizing the question and the whole image without considering the leading role of the question.Also,recent object spatial inference are usually conducted on pixel level instead of object level.Therefore,we propose a novel but simple framework,namely Question-Led Object Attention(QLOB),to improve the VQA performance by exploring question semantics,finegrained object information,and the relationship between those two modalities.First,we extract sentence semantics by a question model,and utilize the efficient object detection network to obtain a global visual feature and local features from top r object region proposals.Second,our QLOB attention mechanism selects those question-related object regions.Third,we optimize question model and QLOB attention by a softmax classifier to predict the final answer.Extensive experimental results on three public VQA datasets demonstrate that our QLOB outperforms the state-of-the-arts.
Keywords/Search Tags:Deep Learning, Computer Vision, Natural Language Processing, Visual Question Answering, Attention Mechanisms
PDF Full Text Request
Related items