Relation-based Visual Question Answering

Posted on:2021-02-27

Degree:Doctor

Type:Dissertation

Country:China

Candidate:C F Wu

Full Text:PDF

GTID:1368330605981258

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology,multimedia data is exploding.In these multimedia data,the data of a single media often does not exist independently,but is naturally symbiotic and has semantic relevance,so it is called "cross-media" data.It has both application and research challenge to explore the semantic association of cross-media data,improve the semantic understanding and reasoning of cross-media data,and thus improve people's ability to manage and leverage cross-media data.Visual Question Answering(VQA)is a typical cross-media analysis and reasoning task.It takes visual and language,the two typical media forms as the input of the task,and answers that are easy to evaluate as the output of the task.Since the VQA task requires the machine to simultaneously represent and understand vision and language and requires combining both to reason,it is also called "Visual Turing Machine" and"AI-Complete".This article is based on the in-depth study of the difficulty of the visual question answering task and the extensive analysis of existing research work.The main research results include:A feature relation based differential fusion model for VQA is proposed.By mapping both visual features and linguistic features to differential modalities,cross-modal information can be better represented.First,a Differential Network(DN)is proposed,which uses differential networks to map features of different modalities to differential modalities.Second,A DN based Fusion(DF)is proposed to model the feature interaction between differential modalities.Experiments on public datasets show that the performance of differential fusion is better than the existing linear and nonlinear fusion methods,and draws the distance between different modalities closer.An object relation based comparable attention model for VQA is proposed.By pairwise comparison between objects,cross-media information can be better filtered.First,an Object Difference Attention(ODA)is proposed,and the difference between the objects is obtained through the difference operation between the objects,and then the difference information is used to select the visual object that is useful for answering the question.Second,ODA is expanded to a more general Comparative Attention(CA),and four CA kernels are proposed.Experiments on public datasets show that comparable attention has higher performance than existing non-comparable attention methods,and different comparison cores are good at answering different types of questions.A high-order relation based chained reasoning model for VQA is proposed.By generating new objects and new relations iteratively,the cross-media information can be better decided.A Relational Reasoning(RR)module is proposed to calculate the compound relations between objects,and an Object Refining(OR)module refines the new relations into a new compound object.Based on the above two modules.a chain of Reasoning(CoR)is proposed.Through iterative relational reasoning and object refining,the answer of the question is gradually inferred.The experiments on the public datasets show that the performance of the chain structure is better than the existing parallel and stack structures,and the intermediate results of the reasoning are interpretable.A sample relation based knowledge memory model for VQA is proposed.By searching for related samples,a contextual knowledge memory is formed to better enrich cross-modal information.First,the Knowledge Memory(KM)module is proposed,and a Bert classifier is trained to determine whether the answer of the question is implicit in another question.Then,all the questions that may imply answers are composed into memory and served for a Dynamic REAsoning Machine(DREAM)model that includes multi-level representation and cross-head inference.DREAM is currently the first place in the GQA Challenge and has achieved the best performance in both binary and open problems.A visual question answering demonstration system for research is realized.For the purpose of research,the system can compare the dynamic changes of output answers and explanations under different images,different questions,and different models to help researchers better analyze the performance of the model.

Keywords/Search Tags:

Visual Question Answering, Cross-modal Inference, Deep Learning

PDF Full Text Request

Related items

1	Research And Algorithm Implementation Of Efficient Visual Question Answering Based On Deep Learning
2	Unicoder-VL:A Universal Encoder For Vision And Language By Cross-modal Pre-training
3	Multi-modal Information Fusion In Visual Question Answering
4	Fine-grained Visual Question Answering Based On Deep Learning
5	Research And Implementation Of Visual Question Answering System Based On Deep Learning
6	Research Of Visual Question Answering Technique Based On Deep Learning
7	Research On Visual Question Answering Based On Deep Learning
8	Research On Visual Question Answering Method Based On Deep Learning
9	Research On Collaborative Attention Model And Deep Correlated Networks For Visual Question Answer
10	Research And Application Of Visual Question Answering Based On Deep Neural Network