Font Size: a A A

Research On Visual Question Answering Based On Text Semantic Understanding

Posted on:2020-08-06Degree:MasterType:Thesis
Country:ChinaCandidate:M F LiFull Text:PDF
GTID:2428330578979407Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Visual Question Answering(VQA)is a hot research topic related to natural language processing and computer vision in recent years.This thesis studies the cross-modality fusion of question text and image starting from the perspective of semantic understanding of question text.Text-guided dual-branch attention network,unbalanced pooling of question text splitting,and completely unbalanced pooling of multi-encoding question text splitting are the three research directions.Finally,a more accurate open answer is generated.The main work of this thesis is as follows:1)The existing text-guided visual question answer model has insufficient text semantic representation and insufficient information sharing between modalities.To solve this problem,this thesis studies the rapid extraction of semantic features of question texts and the extraction of differences and complementarities between modalities.Then proposes a visual question answer method based on text-guided dual-branch attention network.The main highlights of this method are:i)using 1D-GCNN parallel encoding the joint embedding vector composed of Glove features and word positions to obtain text semantic feature representation;ii)constructing a dual-branch network consisting of multi-modal cross-guided co-attention network and multi-modal factorized high-order co-attention network,mining the differences and complementarities between the cross-modals,extracting the joint features of the respective model;iii)automatically generating the weights of the joint features in the text-guided weight learning.Then weighting and summing them to form the final joint feature.Experiments on the VQA 2.0 and COCO-QA datasets demonstrate the effectiveness of the TDAN visual question answer method based on problem text semantic representation and multi-feature fusion.2)The existing visual question answer method has a semantic understanding of the question text as a whole,which leads to the content understanding is not clearly targeted.To solve this problem,this thesis studies the question text splitting and unbalanced pooling,and proposes a visual question answer method based on unbalanced pooling of question text splitting.The main highlights of this method are:i)splitting the text semantic features into the question text header and the question text footer.The question text footer contains object information,and the question text header contains question type information;ii)Study the feature reinforcement layer structure and stack it on the multi-modal factorized bilinear pooling model to construct an unbalanced pooling model to obtain unbalanced joint features.Experiments on the VQA 2.0 and COCO-QA datasets show that in the process of answer generation,the proposed visual question answer method based on unbalanced pooling of question text splitting makes full use of the semantic information of different parts of the question text to guide the important image content.The rationality of question text splitting and unbalanced pooling is verified.3)The unbalanced multi-modal pooling module studied in this thesis still has a partially balanced pooling component,which leads to the superimposition of multi-layer feature reinforcement layer and greatly improves the complexity of the model.To solve this problem,this thesis studies the semantic extraction module under multi-encoding and the completely unbalanced pooling module,and proposes a visual question answer method based on completely unbalanced pooling of multi-encoding question text splitting.The highlight of this method is:i)using Bi-GRU and 1D-GCNN to encode question embedding vector consisting of pre-trained fixed vector and randomly initialized variable vector to form a joint encoding vector;ii)using the feature reinforcement exponent to optimize the original extended joint feature,simplifies the feature reinforcement linear overlap structure in the unbalanced pooling module,and construct a completely unbalanced pooling module.Experiments on VQA 2.0 and COCO-QA datasets show that the proposed method can make full use of different word vector embedding and different encoding methods to mine the semantics of question texts,which helps to improve the accuracy of answer prediction.At the same time,the complexity of the method is greatly simplified.
Keywords/Search Tags:visual question answering, question splitting, co-attention mechanism, text attention mechanism, multi-modal pooling
PDF Full Text Request
Related items