Font Size: a A A

Video Question Answering Based On Attention Mechanism And Graph Convolutional Network

Posted on:2022-05-17Degree:MasterType:Thesis
Country:ChinaCandidate:B L ZhangFull Text:PDF
GTID:2518306317489584Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Along with the development of computer vision and natural language processing,more and more visual and language research topics have been widely concerned.Video question answering task is a cross-modal task,which includes many kinds of modal data,such as video,voice,text and so on.Its task form is to give a short video and a natural language description of the question,and the video question answering system needs to understand the video and the question semantics to answer the question.The current video question answering model has solved the problem of understanding the semantic information between different modals,but there are still some shortcomings:(1)Using convolutional neural network and recurrent neural network to model the spatiotemporal information in video,but the obtained temporal features can not express the association between different video frames;(2)When the video contains multiple moving objects,the current method can not extract the motion information of each object pertinently,and is easily disturbed by the coarse-grained information in the video;(3)The current method based on graph convolutional network uses the visual information of the object to construct nodes.It can not express the position and motion information of the object,and lacks guidance in the reasoning process.To solve the above problems,this paper proposes a Question-aware Motion Graph Convolutional Network(QMGCN),which consists of three core modules.(1)The sub-video alignment module obtains the sub-video with the same spatial dimension and containing only the object image,and then divides each sub-video into multiple clips at the same time,to extract the motion feature of the object at different time periods.The feature extracted by this method can enhance the attention to the motion information of single object,and eliminate the visual interference in the unrelated region of the background.(2)The object joint feature generation module adds the spatial(position,size)and category information of object to the motion feature.The object joint features acquired have stronger semantic representation ability without losing spatial information,and can be better combined with the question features.(3)The question-aware graph reasoning module uses the object joint features and the question features generated by BiGRU as the input of the attention model to generate the question-aware joint feature.It is used as the input of graph convolution network to deduce the complex relationship between different objects in video,so can be gradually generated to the nodes related to the question and strengthen their connection.Finally,the model is tested on MSVD-QA,MSRVTT-QA,SVQA and TGIFQA datasets and compared with the existing methods.The results show that the proposed method is effective.
Keywords/Search Tags:deep learning, video question answering, attention mechanism, graph convolutional network
PDF Full Text Request
Related items