Font Size: a A A

Research On Visual Question Answering Based On Multiple Attention Mechanism And Feature Fusion Algorithm

Posted on:2021-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:S T ZhouFull Text:PDF
GTID:2428330614458480Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
The visual question answering task is a frontier direction that combines computer vision research and natural language processing.The visual question answering system can find useful information from the images matching the question according to the semantics of the question to predict the answer to the question.The visual question answering task model includes four modules: image feature processing,text feature processing,multi-modal feature fusion and answer prediction.Among them,image feature processing and text feature processing belong to the category of feature extraction.In the current visual question answering research,how to perform feature extraction,multi-modal feature fusion and improvement of attention mechanism have always been the difficult problems of research,so this article will explore and study these three problems:1.Image preprocessing model based on Faster-RCNN target detection algorithm.In this thesis,Faster-RCNN and Resnet101 are combined to process image information.Faster-RCNN is used to identify object instances belonging to the class and use bounding boxes to locate them.The Resnet101 model preprocesses the VQA v2 data set and extracts 2048 Dimensional image feature vectors and image feature information participate in the training of visual question answering models in the form of matrix vector files.2.Research on visual question answering model based on multi-modal feature fusion.In order to solve the problem of cross-modal feature fusion,based on the working basis of 1,this thesis uses pre-trained word vector tools and long-term and short-term memory networks to characterize the text features,forming a 2048-dimensional feature vector to represent the problem.Then the 2048-dimensional image feature vector and the 2048-dimensional problem feature vector are input into the multimodal decomposition bilinear pooling feature fusion algorithm module to generate fusion features.Finally,the answer prediction module uses Soft Max as the classifier for answer prediction output.The experimental results on the VQA v2 data set prove that the visual question answering model constructed in this thesis is reasonable and scientific.3.Research on visual question answering model based on multiple attention mechanism of multi-modal feature fusion.In order to strengthen the semantic information of the model and capture more accurate image feature information,this thesis adds a self-attention mechanism,a guided attention mechanism and a multi-head attention mechanism on the basis of the work based on 2,to form a visual question answering model based on the multiple attention mechanism.It aims to better capture the relevant semantic information between pictures and text,and shorten the gap of multi-modal feature fusion.The experimental results show that the visual question answering model combined with the multi-attention mechanism and the multi-modal decomposition bilinear pooling feature fusion algorithm has higher accuracy and is superior to the advanced model.
Keywords/Search Tags:visual question answering, target detection algorithm, multi-modal feature fusion, multiple attention mechanism
PDF Full Text Request
Related items