Font Size: a A A

Research On Visual Question Answering System Based On Image Attention

Posted on:2022-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:Z X LiaoFull Text:PDF
GTID:2518306575464774Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
The Visual Question Answering(VQA)system is a task that takes images and questions as input and computer combines the input image information and text information to generate a human language as output.It involves in the two fields of computer vision and natural language processing.In recent years,the key solution for visual question answering mainly lies in how to use effective image feature information and how to solve the information interaction of image feature and problem feature.Aiming at these two problems,this thesis proposed an Image Information Enhanced Network(IEN)and a Cross-guided Attention Networks(CGAN)for visual question answering.The main contents include:1.In view of the problem that most of the current VQA methods mainly use the global characteristics of the image,and ignore the effective image semantic information,which results in not effectively understanding the image at a fine-grained level,the IEN VQA model was proposed for image information enhancement.The IEN model extracts image features through Faster Recursive Convolutional Neural Networks(Faster-RCNN)and Long and Short-Term Memory(LSTM)network to obtain enhanced image information features,and then introduces the image spatial domain attention mechanism to obtain the weighted feature vector.Meanwhile,the IEN model uses the Glo Ve word embedding model and the LSTM network to extract the text question features.On this basis,the IEN model combines the weighted image features for feature fusion,and finally sends the fused features to the Soft Max classifier to predict the answer.2.Aiming at the problem that most VQA models only introduce the attention mechanism in images or texts,but do not fully explore the spatial and logical relations between the two,the cross-guided attention mechanism based CGAN VQA model was proposed to forms a kind of cross-guided strategic attention by adding an attention mechanism to images and texts,so that image features and text features can get sufficient information interaction,thereby greatly improving the performance of the model.3.This thesis conducts experiments and verifications on the typical public data sets of VQA2.0 and CQCQ-QA.Experimental results show that the overall accuracy of the IEN model and CGAN model in this thesis on the VQA2.0 data set can reach 66.4% and67.43%,respectively,and the accuracy of the CGAN model on the CQCQ-QA data set is63.7%.The overall performance of the proposed models in this thesis is better than other that of the mainstream visual question answering models.4.Finally,this thesis designed and implemented an application system for visual question answering.We proposed a framework for the VQA application system and used the development kit Py Qt5 based on the Python language UI interface to design and develop VQA application system.The implemented VQA system was tested on the VQA2.0 data set.According to the test results,the proposed VQA models in this thesis can solve the general VQA questions and has good practicability.
Keywords/Search Tags:visual question answering, image feature, text feature, feature fusion, cross-guided attention mechanism
PDF Full Text Request
Related items