Font Size: a A A

Research And Applications Of Visual Ouestion Answering Technology Based On Sub Ouestion Sequence

Posted on:2023-01-24Degree:MasterType:Thesis
Country:ChinaCandidate:R N WangFull Text:PDF
GTID:2568306914471694Subject:Intelligent Science and Technology
Abstract/Summary:PDF Full Text Request
Visual Question Answering(VQA)refers to predicting the answers described in the natural language given an image and a natural language question grounded on the image.The solving process involves both language and visual information processing,which is one of the important tasks in the research of multimodal information processing.The current VQA methods usually try to answer the original questions directly.These methods can not get satisfactory performance for complex questions.When answering complex questions,people usually decompose the complex question into a series of simple sub questions,and finally,get the answer to the original question by gradually answering the sequence of sub questions.This paper studies VQA based on the simulation of the above process.The main work is as follows:A conversation-based Visual Question Answering(Co-VQA)framework based on sub question sequence(SQS)is proposed.The framework is composed of three parts:sub question generator(Questioner),sub question responder(Oracle),and visual question responder integrating sub question sequence and original question(Answerer).Questioner raises the sub questions and Oracle answers them one-by-one,finally,Answerer gives the final answer based on the interaction results.This paper designs corresponding models for the above three parts.As a generation task,Questioner is modeled based on the extended hierarchical encoder-decoder model.Oracle is modeled based on the classical co-attention model.As for Answerer,as it needs to consider the original question,image,and complete historical sub question and answering sequence,this paper designs an Adaptive Chain Visual Reasoning Model(ACVRM)to realize the explicit reasoning process based on SQS,in which the sub question guides the update of visual features through graph attention network.To perform supervised training for each model,a well-designed general semi-automatic SQS construction technology is proposed.It includes four stages:key information extraction,template-based sub question construction,answer judgment based on image information,and manual correction,then the high-order complex question is decomposed into a low-order sub question sequence.This paper applies this method to two widely-used VQA datasets:VQA 2.0 and VQA-CP v2 and constructs SQS for each question in the dataset.On this basis,the Co-VQA method proposed in this paper is experimentally verified.The experimental results show that Co-VQA has achieved similar performance to the optimal model without pre-training on VQA 2.0 dataset,and has achieved the state-of-theart performance on VQA-CP v2 dataset.At the same time,a large number of experimental analyses show that Co-VQA can provide better interpretability and error traceability than existing models.Finally,an interactive VQA demonstration system based on the web is realized by using the Co-VQA method proposed in this paper.Through attention visualization and internal dialogue visualization,the system provides users with well interpretable VQA services.At the same time,to collect high-quality new SQS samples,users can participate in the internal interaction process of question answering as the sub question generator/respondent,which further increases the human-computer interaction.
Keywords/Search Tags:Visual Question Answering, Sub Question Sequence, Co-VQA, Chain Visual Reasoning
PDF Full Text Request
Related items