Font Size: a A A

Research And Algorithm Implementation Of Efficient Visual Question Answering Based On Deep Learning

Posted on:2021-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:B Q LinFull Text:PDF
GTID:2428330623968156Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Visual question answering(VQA)is a combination of computer vision and natural language processing.The model needs to make inference and answers based on the information of image and question.Layered attention mechanism of image feature extraction is designed for VQA model in this thesis.This attention mechanism consists of two sub attention mechanisms with different structure.The first level attention uses the object detection network as the backbone network.The object detection network is a computer technology related to image processing,which is used to detect semantic objects in images and videos.This attention mechanism takes the original image as the input and outputs the feature of objects in the image.The second level attention takes the first level attention's output and the question feature extracted by the recurrent neural network as the input and outputs the question-directed feature.This image feature has better ability to represent the current task because the background information is filtered,and the question information is introduced.The test results show that the layered attention has a greater improvement on the accuracy of counting question,and it is also helpful for answering other kinds of questions.This thesis improves the Multi-modal Factorized Bilinear Pooling(MFB)proposed by Zhou Yu et al.The improved feature fusion module is implemented by full convolutional layer and global pooling layer,which eliminates the limitation of input dimension,so that module can receive multi-dimensional inputs.In addition,we improve the nonlinear expression ability of this module by using nonlinear activation layer.As the number of layers in deep learning model gradually increases,the model's demand for hardware resources has also gradually increased,especially in the training stage,model consumes a lot of memory,bandwidth,disk and computing resources,which makes the research results of deep learning difficult to go out of the laboratory.In order to reduce the demand of resources in VQA model,we use the kernel pruning algorithm to prune the unimportant convolutional kernel according to the threshold.Pruning technology can reduce the memory occupation rate,feedforward propagation delay and other indexes on the promise that the accuracy drop within an acceptable range.In test stage,we test the contribution rate of each module to the performance and compare our model with the domestic and international research results.The test result shows that our model greatly improves the accuracy of answering counting questions,which is about 5% higher than the existing model,and the accuracy of answering other types of questions has also been improved.The total test accuracy exceeds the existing model.Our compressed model reduces the occupied memory space by 13% and the feedforward delay by 16% on the premise that the accuracy rate only decreases by 0.8%,which proves the feasibility of compressing the visual question answering model.
Keywords/Search Tags:visual question answering, attention mechanism, multi-modal feature fusion, neural network, deep learning
PDF Full Text Request
Related items