Font Size: a A A

Research On Visual Question Answering System Based On Relational Reasoning Network

Posted on:2021-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:X F DingFull Text:PDF
GTID:2518306050465554Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the deep progress in deep learning and neural network technology,computer vision and natural language processing have achieved remarkable success in their respective fields.Among them,convolutional neural networks and recurrent neural networks are represented in image recognition,image detection,and image Segmentation,speech recognition,speech translation and other related tasks have played an important role.The visual problem system combines technologies such as computer vision and natural language processing to realize intelligent question answering in complex scenes and promote the development of artificial intelligence applications such as human-computer interaction.It has attracted more and more attention from researchers.Unlike traditional question answering systems that only receive textual information and understand predictions and lack of context,the answer is incorrect.The visual question answering system provides a scene-level visual information.In addition to understanding natural language questions,it also combines content provided by images to make predictions.In order to obtain accurate answers.In recent years,driven by many accurate and authoritative data sets collected from the real world such as DAQUAR,VQA,COCO-QA,etc.,a variety of visual question answering algorithms have emerged,such as algorithms based on traditional machine learning,algorithms based on database search,based on Attention mechanism algorithm,etc.These algorithms have not completely solved the high precision and real-time requirements that the visual question answering system needs to achieve,thus affecting the further development of the visual question answering system.Therefore,this paper specifically proposes a visual question answering system that integrates visual information and text information.By designing an attention mechanism to merge the correlation features of the image and the question,and embedding the relational reasoning network to predict the answer,the question answering accuracy is significantly improved.The main work of this article is as follows:1.A two-branch convolutional neural network architecture is proposed to extract image feature information,and on the premise of multi-level understanding of image content,an attention fusion mechanism is used to effectively extract image-problem joint features,thereby improving Multi-modal feature expression ability to improve the answer prediction accuracy of visual question answering system.First,for the image data in the input data,the Resnet model and the Faster_RCNN model are used as two image feature extraction branches to perform feature extraction on the image to obtain global image features and local image features;then,this mechanism considers both visual feature extraction The mutual relationship between network branches,and thus obtain multi-level image-problem joint features,which contain only the visual feature information that is most relevant to the given problem;finally,the attention generated by these two attention mechanisms Attempt to perform non-linear fusion to produce joint image-problem features.2.It is proposed to introduce the relational reasoning network into the visual question answering system to improve the system's relational reasoning ability for multi-modal joint features,thereby improving the accuracy of answer prediction.The traditional visual question answering system fails to use the relationship information between the features in the joint feature,which results in a large amount of redundant information by traversing all the feature combinations in the process of predicting the answer.The visual question answering system embeds the relational inference network as a special neural network module that calculates the relationship into the visual question answering system,and uses its relational reasoning ability between features to screen feature combinations to achieve the purpose of improving accuracy.3.The models proposed in this paper are compared with the existing models under the two authoritative data sets of VQA and COCO-QA,respectively,and multiple model comparison experiments and single model ablation experiments are conducted respectively.For multi-model comparison experiments,under the standard test partition and verification partition of the VQA data set,the average prediction accuracy of the model was improved by 1.5% ~ 3.2%,and under the test partition of the COCO-QA data set,the average prediction of the model Accuracy and standard WUPS improved by 1.3% ~ 2.6% and 1.1%~ 2.4%;for single-model ablation experiments,under the standard test partition of the VQA data set,the average prediction accuracy of this model after replacing the innovation submodule It fell by 1.7% ~ 4.2%.Experimental results prove that the use of a two-branch image extraction network and attention fusion mechanism can deeply understand the image content and generate efficient multi-modal joint feature information,which plays a certain role in improving the accuracy;using relational reasoning network embedding is not only effective The accuracy of model prediction is improved without affecting the speed of model operation.
Keywords/Search Tags:Visual question answering system, relational inference network, convolutional neural network, recurrent neural network, attention mechanism
PDF Full Text Request
Related items