Font Size: a A A

Research On Visual Question Answering Based On Deep Neural Network

Posted on:2022-11-12Degree:MasterType:Thesis
Country:ChinaCandidate:P R ZouFull Text:PDF
GTID:2518306776497054Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of computer vision and natural language processing technology,visual question answering(VQA)has been playing a crucial role in fields such as intelligence education,visual impairment assistance and bionic robots.As a typical multi-modal task in the military domain,visual question answering aims to combine vision,language and advanced reasoning to automatically answer linguistic questions related to images.In addition,the VQA task is a test of machine intelligence and a benchmark for general artificial intelligence,and it has great application value and prospects.The main work in this paper is as follows:(1)Most of the existing VQA models ignore the dynamic relationships of semantic information between two modes and the rich spatial structure of an image.For this reason,a Multi-Module Co-Attention Model named MMCAN for visual question answer is proposed,which can provide a full understanding of the dynamic interaction and textual contextual representation of the relationship between objects in visual scenes.Firstly,we model the relationships between different types of objects through a graph attention mechanism to learn an adaptive relational representation of the problem.Secondly,we use the text features and the visual relationship with relationship attributes to improve the correlation between word embedding and the corresponding image region through collaborative attention coding.Finally,the model's fitting ability is improved through an attention enhancement module.We realize the MMCAN algorithm and do experiments with the VQA 2.0 and VQA-CPv2 datasets: on the test-dev and test-standard of VQA 2.0,the accuracy of "overall","yes/no","count" and "other" categories of questions are 68.47%,84.93%,49.57%,58.68% and 68.85%,85.28%,49.76% and 58.84% respectively.On VQA-CPv2,the accuracy of four types of problems are 40.36%,42.42%,12.97% and46.67% respectively.The experiments show that the accuracy of the proposed model is significantly better than DA-NTN,Re GAT,and ODA-GCN algorithms,which can effectively improve the accuracy of visual question answering.(2)By further analyzing the complex scenes of images,we find that most VQA models cannot capture deeper relationship semantics and lack interpretability of the network.To this end,a Scene Relational Model is proposed called SRN.This explicitly integrates the semantic and spatial relationship of the scene,and utilizes the relationship between visual scenes and their attributes to support VQA reasoning.Firstly,we construct a scene graph network based on visual object relationships detected in images.Secondly,this paper encodes pre-defined scene semantic relationships and spatial object relationships through an adaptive problem graph attention mechanism to learn multi-mode feature representation.Finally,the two relational models are linearly fused to infer the answer.We realize the SRN algorithm and do experiments with VQA 2.0 datasets: On test-dev and test-standard splits,the accuracy of "overall","yes/no","count" and "other" categories of questions are 69.44%,85.69%,49.51%,58.73% and 69.92%,86.11%,50.14% and 59.57% respectively.The results show that the SRN model can carry out visual relation reasoning under the guidance of problems,which is more effective for natural language problems with complex structure.
Keywords/Search Tags:visual question answering, attention mechanism, relational reasoning, multimodal learning, feature fusion
PDF Full Text Request
Related items