Font Size: a A A

Research On Deep Learning Based Visual Question Answering

Posted on:2019-03-13Degree:MasterType:Thesis
Country:ChinaCandidate:X WangFull Text:PDF
GTID:2428330548458931Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Methods based on convolutional neural networks make huge breakthrough in the field of computer vision,while the advance in neural network language model and recurrent neural networks are pushing the development of natural language processing.With the improvement of object detection and neural machine translation,more and more researchers are trying to focus on the research of visual question answering.What visual question answering is different from traditional question answering is that except the question understanding,the system generates answers according to the content of images.In order to have a better understanding of visual question answering,datasets for training is a prior.Existing experiment datasets are DAQUAR,COCO-QA,Visual Genome,FVQA,VQA 1.0 and so on.The developing trends of these dataset are improving the amount of images,enriching the content of questions and collecting more precise answers.Visual question answering methods can be classified as:traditional machine learning methods,joint embedding methods,attention based methods and models with external database.After a brief view of current research on visual question answering,we firstly introduce the basic neural network,convolutional neural network,recurrent neural network and a modified version of that,which is named as Long-short Term memory network.After that,we introduced the attention mechanism and its application.we found that different convolutional neural networks tend to extract feature of image at different level,so we use ResNet to extract the global feature of images and Mask R-CNN to get local image features.Besides,attention mechanism can be used to integrate the image features and question encoding.Furthermore,stacked attention layers were used to improve the coupling of images a question encoding.We introduced two algorithms for visual question answering,which are:Object Feature based Visual Question Answering and Dual View with Stacked Attention Visual Question Answering.These two algorithms'model architecture,the way to extract image features,the way to encoding questions and the way to utilize attention mechanism were presented in detail.The models were constructed under Pytorch platform and we used VQA 1.0dataset to train the model with GPU cards to decrease the training time.Based on the architecture of the proposed network,we try to find out the impact of l2 normalization,dropout layers,size of hidden layer in GRU and the number of attention layers with different configurations.From the experiments results,we can tell that with applying l2 normalization,adding Dropout layer,increasing the size of hidden layer in GRU and using two-layer attention layer can increase the overall accuracy.The results on the test datasets revealed that the proposed methods can extract image feature at different level,understanding the meaning of question and give appropriate answer according to the image.and questions.In the meantime,the accuracy of the model was improved compared with several existing models.At last,we selected several example and discussed the answers given by our algorithms.
Keywords/Search Tags:Deep Learning, Visual Question Answering, Convolutional Neural Network, Recurrent Neural Network
PDF Full Text Request
Related items