Font Size: a A A

Relation-based Visual Question Answering

Posted on:2021-02-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:C F WuFull Text:PDF
GTID:1368330605981258Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,multimedia data is exploding.In these multimedia data,the data of a single media often does not exist independently,but is naturally symbiotic and has semantic relevance,so it is called "cross-media" data.It has both application and research challenge to explore the semantic association of cross-media data,improve the semantic understanding and reasoning of cross-media data,and thus improve people's ability to manage and leverage cross-media data.Visual Question Answering(VQA)is a typical cross-media analysis and reasoning task.It takes visual and language,the two typical media forms as the input of the task,and answers that are easy to evaluate as the output of the task.Since the VQA task requires the machine to simultaneously represent and understand vision and language and requires combining both to reason,it is also called "Visual Turing Machine" and"AI-Complete".This article is based on the in-depth study of the difficulty of the visual question answering task and the extensive analysis of existing research work.The main research results include:A feature relation based differential fusion model for VQA is proposed.By mapping both visual features and linguistic features to differential modalities,cross-modal information can be better represented.First,a Differential Network(DN)is proposed,which uses differential networks to map features of different modalities to differential modalities.Second,A DN based Fusion(DF)is proposed to model the feature interaction between differential modalities.Experiments on public datasets show that the performance of differential fusion is better than the existing linear and nonlinear fusion methods,and draws the distance between different modalities closer.An object relation based comparable attention model for VQA is proposed.By pairwise comparison between objects,cross-media information can be better filtered.First,an Object Difference Attention(ODA)is proposed,and the difference between the objects is obtained through the difference operation between the objects,and then the difference information is used to select the visual object that is useful for answering the question.Second,ODA is expanded to a more general Comparative Attention(CA),and four CA kernels are proposed.Experiments on public datasets show that comparable attention has higher performance than existing non-comparable attention methods,and different comparison cores are good at answering different types of questions.A high-order relation based chained reasoning model for VQA is proposed.By generating new objects and new relations iteratively,the cross-media information can be better decided.A Relational Reasoning(RR)module is proposed to calculate the compound relations between objects,and an Object Refining(OR)module refines the new relations into a new compound object.Based on the above two modules.a chain of Reasoning(CoR)is proposed.Through iterative relational reasoning and object refining,the answer of the question is gradually inferred.The experiments on the public datasets show that the performance of the chain structure is better than the existing parallel and stack structures,and the intermediate results of the reasoning are interpretable.A sample relation based knowledge memory model for VQA is proposed.By searching for related samples,a contextual knowledge memory is formed to better enrich cross-modal information.First,the Knowledge Memory(KM)module is proposed,and a Bert classifier is trained to determine whether the answer of the question is implicit in another question.Then,all the questions that may imply answers are composed into memory and served for a Dynamic REAsoning Machine(DREAM)model that includes multi-level representation and cross-head inference.DREAM is currently the first place in the GQA Challenge and has achieved the best performance in both binary and open problems.A visual question answering demonstration system for research is realized.For the purpose of research,the system can compare the dynamic changes of output answers and explanations under different images,different questions,and different models to help researchers better analyze the performance of the model.
Keywords/Search Tags:Visual Question Answering, Cross-modal Inference, Deep Learning
PDF Full Text Request
Related items