Font Size: a A A

Research On Visual Question Answering Based On Deep Neural Network And Attention Mechanism

Posted on:2019-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:Q LiFull Text:PDF
GTID:2428330542994088Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Visual Question A,nswering(VQA)is a challenging task that was proposed to con-nect computer vision and natural language processing.Given an image and a textual question in natural language,this task requires reasoning over visual content of the im-age and common sense knowledge to infer the correct answer to the question.Therefore,a machine must be equipped with cross-modality understanding across language and vi-sion,which is far more demanding than tasks in singie modality like image recognition or document classification.The importance of VQA is multi-fold.From the perspective of computer vision(CV),VQA is naturally the next step towards a full understanding of visual media like images and videos,following the tasks of image/video recognition and caption-ing.From the point of natural language processing(NLP),connecting to the visual world is essential for truly understanding human language.Both CV and NLP belong to the scope of artificial intelligence(AI)but they have developed separately for histor-ical reasons.VQA is considered as a milestone of the fusion of these two research field,towards complete and general AI.VQA has received a lot of research attention in recent years from both computer vi-sion and natural language processing communities.Most existing approaches adopt the pipeline of representing an image via pre-trained convolutional neural networks(CNN),and then using the uninterpretable CNN features in conjunction with the question em-bedding from recurrent neural networks(RNN)to predict the answer.Although such end-to-end models might report promising performance,they rarely provide any insight,apart from the answer,into the VQA process.Therefore,we propose to break up the end-to-end VQA into two steps:explaining and reasoning,in an attempt towards a more explainable VQA by shedding light on the intermediate results between these two steps.Our system achieves comparable performance with the state-of-the-art,yet with added benefits of explanability and the inherent ability to further improve with higher quality explanations.Besides,most existing works in VQA are dedicated to improving the accuracy of predicted answers,while disregarding the explanations.We argue that the explanation for an answer is of the same or even more importance compared with the answer itself,since it makes the question and answering process more understandable and traceable.To this end,we propose a new task of VQA-E(VQA with Explanation),where the com-putational models are required to generate an explanation with the predicted answer.We first construct a new dataset,and then frame the VQ A-E problem in a multi-task learning architecture.We have conducted a user study to validate the quality of explanations syn-thesized by our method.We quantitatively show that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers,but also improve the performance of answer prediction.Our model outperforms the state-of-the-art methods by a clear margin on a popular VQA dataset.
Keywords/Search Tags:Visual Question Answering, Convolutional Neural Network, Recurrent Neural Network, Attention Mechanism, Explainable Model
PDF Full Text Request
Related items