Font Size: a A A

Video Question Answering Based On Deep Memory Fusion Method

Posted on:2022-09-09Degree:MasterType:Thesis
Country:ChinaCandidate:M WuFull Text:PDF
GTID:2518306314468814Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the deep-going research of artificial intelligence,video question answering which is a cross-field of computer vision and natural language processing has attracted the widespread attention from many researchers.The main task of video question answering is to output the correct answer based on a given clip of video and a question described in natural language.With the popularity of short video applications in recent years,the research on video question answering will greatly promote the commercial application of multimedia information retrieval and artificial intelligence assistants.Compared with images,video contains more multimodal information and reasoning information.At present,the video question answering model based on recurrent neural network has the following shortcomings: The one is that only the key frame features from the video are used as the basis for question answering,and the temporal information of the video itself is ignored;The second is the lack of effective information storage components in the inference process of the video question answering model,which will lead to the loss of important information associated with the question;the third is the lack of efficient multi-modal fusion methods,which makes it difficult to effectively interact between video features and text features.In response to the above problems,this paper proposes the Deep Memory Fusion Model(DMFM)based on the idea of memory network in text question answering research.This model mainly includes four parts: memory storage module,filter redundant memory module,memory fusion module and answer generation module.In the memory storage module,the model firstly uses the convolutional neural network called Res Net and the three-dimensional convolutional network called C3 D to extract video features,uses the word embedding method called Glo Ve to encode the subtitle text,and then uses the optimized multi-modal fusion method MCB(Multimodal Compact Bilinear Pooling,MCB)to make the multi-modal fusion of video and subtitle features and then store in the memory component of the memory network.Secondly,in the filtering redundant memory module,according to the problem characteristics,the multi-modal similarity matching method is used to match and score the memory information in the memory module,and the fusion features related to it are selected.Then,in the memory fusion module,through the preliminary fusion of the multilayer convolutional network and the secondary fusion operation of the attention mechanism method,the multi-modal context representation of the entire video is generated,and the answer is finally generated in the answer generation module based on the representation.The model proposed in this paper is tested on the public datasets Movie QA and TGIF-QA.The accuracy rate is significantly improved compared with the existing methods,and the model owns good generalization performance.
Keywords/Search Tags:video question answering, video understanding, multi-modality fusion, memory network, attention mechanism
PDF Full Text Request
Related items