Video Question Answering Based On Attention Mechanism And Graph Convolutional Network

Posted on:2022-05-17

Degree:Master

Type:Thesis

Country:China

Candidate:B L Zhang

Full Text:PDF

GTID:2518306317489584

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Along with the development of computer vision and natural language processing,more and more visual and language research topics have been widely concerned.Video question answering task is a cross-modal task,which includes many kinds of modal data,such as video,voice,text and so on.Its task form is to give a short video and a natural language description of the question,and the video question answering system needs to understand the video and the question semantics to answer the question.The current video question answering model has solved the problem of understanding the semantic information between different modals,but there are still some shortcomings:(1)Using convolutional neural network and recurrent neural network to model the spatiotemporal information in video,but the obtained temporal features can not express the association between different video frames;(2)When the video contains multiple moving objects,the current method can not extract the motion information of each object pertinently,and is easily disturbed by the coarse-grained information in the video;(3)The current method based on graph convolutional network uses the visual information of the object to construct nodes.It can not express the position and motion information of the object,and lacks guidance in the reasoning process.To solve the above problems,this paper proposes a Question-aware Motion Graph Convolutional Network(QMGCN),which consists of three core modules.(1)The sub-video alignment module obtains the sub-video with the same spatial dimension and containing only the object image,and then divides each sub-video into multiple clips at the same time,to extract the motion feature of the object at different time periods.The feature extracted by this method can enhance the attention to the motion information of single object,and eliminate the visual interference in the unrelated region of the background.(2)The object joint feature generation module adds the spatial(position,size)and category information of object to the motion feature.The object joint features acquired have stronger semantic representation ability without losing spatial information,and can be better combined with the question features.(3)The question-aware graph reasoning module uses the object joint features and the question features generated by BiGRU as the input of the attention model to generate the question-aware joint feature.It is used as the input of graph convolution network to deduce the complex relationship between different objects in video,so can be gradually generated to the nodes related to the question and strengthen their connection.Finally,the model is tested on MSVD-QA,MSRVTT-QA,SVQA and TGIFQA datasets and compared with the existing methods.The results show that the proposed method is effective.

Keywords/Search Tags:

deep learning, video question answering, attention mechanism, graph convolutional network

PDF Full Text Request

Related items

1	Research On Deep Learning Algorithm For Automatic Question Answering
2	Research On Situational Reasoning Question Answer Method Based On Deep Learning
3	A Research Of Video Question Answering Based On Deep Learning
4	Research And Implementation Of VQA Based On Priori Attention Mechanism
5	Visual Question Answering Of Sport Scenes Based On Graph Neural Networks
6	Video Question Answering Based On Deep Learning
7	Intelligent Question Answering Of Deep Recurrent Neural Network Based On Self-Attention Mechanism
8	Research On Visual Question Answering Based On Deep Learning
9	Deep Convolutional Network And Regional Attention Network For Visual Question Answering
10	Research On Collaborative Attention Model And Deep Correlated Networks For Visual Question Answer