Font Size: a A A

Research And Application Of Visual Semantic Representation Model In Video Question Answering

Posted on:2019-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:B WangFull Text:PDF
GTID:2428330626952086Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Bridging the visual understanding and computer-human interaction is a challenging task in artificial intelligence.At present,deep learning technology is widely used in Computer Vision and Natural Language Processing.Though Video Captioning task has shown to be promising in connecting the visual content to natural languages,it usually narrates the coarse semantic of visual content and lacks abilities to model different correlations among visual cues.Whereas visual question answering relies on the holistic scene understanding,which requires that the model can understand different levels of vision,text content and even external knowledge and find the correct answer.When the human brain processes cognitive tasks similar to video question answering,it not only needs to process the information currently received but also needs to retrieve and infer the knowledge stored in the brain according to the information received.Therefore,memory and external knowledge play an important role in the process of cognition and understanding.This paper proposes two approaches.Firstly,the Layered Memory Network(LMN)makes video features contain more semantic information through hierarchical representation process.Secondly,a new data set named PlotGraphs is introduced as external knowledge,and the proposed Plot Graph Representation Network(PGRN)can process video question answering task with LMN.Particularly,the LMN firstly extracts words and sentences from the training video subtitles.Then the hierarchically formed movie representations,which are learned from LMN,not only encode the correspondence between words and visual content inside frames,but also encode the temporal alignment between sentences and frames inside video clips.The dataset contains a large amount of movie information based on the graph structure.The Plot Graph Representation Network(PGRN)proposed in this paper can represent the semantic and relational information in the graph,and can form a new model with LMN to improve the understanding ability of visual content.We conduct extensive experiments on the MovieQA dataset and the PlotGraphs dataset.With only video content as inputs,LMN with frame-level representation obtains a large performance improvement.When incorporating subtitles into LMN to form the clip-level representation,we achieve the state-of-the-art performance on the online evaluation task of ‘Video+Subtitles'.After the integration of external knowledge,the performance of the model consists of LMN and PGRN is further improved.
Keywords/Search Tags:Deep Learning, Visual Question Answering, Layered Memory Network, Plot Graph Representation Network, External Knowledge
PDF Full Text Request
Related items