Font Size: a A A

Research On Theory And Method Of Key Problems In Visual Question Answering

Posted on:2020-01-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Y ZhouFull Text:PDF
GTID:1488305723483764Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Visual Question Answering(VQA)is an emerging research hotspot in the field of Arti-ficial Intelligence.It refers to the task of answering natural language questions based on the given images.The tasks involved in these questions include the object recognition,scene understanding,counting,reasoning and so on.Unlike the traditional tasks of com-puter vision or nature language processing,VQA requires the model to have a thorough understanding of both visual and textual information,and the model should be capable of performing visual reasoning to inference the answer.To this end,VQA is often regarded as an ultimate AI task.The main purpose of this thesis is to investigate the theory and method of existing prob-lems in VQA.In the first chapter,we will introduce the development of VQA by pre-senting existing datasets and describing the representative models of different research directions in VQA,which allows the reader to quickly understand the principle and devel-opment of VQA.Meanwhile,we also conclude main issues that hinder the advancement of VQA,which includes:the interpretablility of model predictions,strong language pri-ors in datasets,the compactness of VQA models,the limited perceptive field of visual features,and the reasoning ability of VQA models and so on.Based on these issues,we further conduct a series of researches in the following chapters,which are as follows:·To address the interpretability of the model's predictions,we propose a new multi?task neural network architecture in Chapter Two.This network uses a neural pivot structure to combine VQA with image captioning(IC).The question-aware caption module shares the same visual backbone with the VQA channel and can generate a short sentence to explain the model prediction.·To address the compactness of VQA models,we propose a novel attention mech-anism in Chapter Three,termed Dynamic Capsule Attention(CapsAtt).CapsAtt can replace the traditional stacked attention structure and perform the multi-step vi-sual reasoning with only one attention layer.Such a design ean greatly reduce the parameter size of VQA models while it still maintains the model performance.·To address the strong language priors in VQA datasets,we propose a novel learning scheme called Pairwise Inconformity Learning(PIL)in Chapter Four.PIL makes use of the image-pair setting provided in VQA2.0 dataset and includes novel designs,e.g.,multi-modal embedding space and the dynamic margin based triplet loss,to force the model answer the question based more on the visual information.·To address the limited perceptive field of the visual features,we propose a novel Multi-modal Pyramid Network(MPN)in Chapter Five.Compared to previous VQA methods,MPN uses the pyramid of convolutional neural networks to perceive the potential answer entities from multiple scales,which subsequently boots the model performance.Meanwhile,MPN is equipped with a novel Top-down Reasoning Scheme to maintain the semantic consistency among joint features learned from different scales.·To improve the reasoning ability of VQA models in natural images,we propose a Triangular Graph Nerual Network(TriGraph)in Chapter Six to help the model to learn question-aware visual relationships,thereby addressing the linguistic condi-tions in questions.Meanwhile,the proposed Semantic Normalized Graph Layer(SNG-Layer)is versatile to most graph neural networks.
Keywords/Search Tags:Visual Question Answering, Attention Mechanism, Visual Reasoning
PDF Full Text Request
Related items