Font Size: a A A

Video-Natural Language Temporal Localization Based On Deep Learning

Posted on:2021-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:X YangFull Text:PDF
GTID:2428330611498163Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the progress of communication technology and microelectronics technology,and the rapid development of Internet,Smart Mobile Phone has been popularized to almost everyone.Picutures,videos,images and speech have also replaced words,become the most mainstream and important information media in people's lives.But for people,natural language is the most convenient and intuitive way to retrive information.The use of natural language to retrieve video and image information is collectively referred to as the cross-modal retrieval problem.The video-natural language localization task is one of the subsets of cross modal retrieval,which can be defined as given natural language description and a video,and find the clips that meet the natural language description in the video.The purpose of this paper is to use deep learning method to solve video natural language location problem,and to improve the existing model.In the research of localization in video,researchers often focus on improving the IOU of result,but lose the accuracyof boundary localization.To solve this problem,this paper introduces the text and video information interaction based on fine-grained,so that each time step in the video can more effectively perceive the text information,so as to judge the corresponding boundaries of the text;at the same time,this paper proposes a method of boundary perception,which learn the boundary precision separately,so as to improve the accuracy of boundary location.The fine-grained interaction of word level text and video information will lose some internal organization and local information described in the language,which is often a semantic entity in video.In order to solve this problem,this paper introduces machine reasoning based on combined attention.Through the continuously extract semantic entity of language description in multiple stages,it is divided into multiple semantic groups which are relatively independent of each other.Through the cross modal information interaction between language semantic groups and video while taking into account the internal organization of the language.For the problem of semantic alignment in video location,temporal adverbials such as "second time","after" affect the localization of text information,this paper use the context fusion based on self attention mechanism to collect effective timedependent information in the video.At the same time,because the self attention mechanism focuses on the dependence of video full-text,and does not consider the timing of video,it introduces local context modeling before full-text context modeling to enhance the relevance and organization of video local information.For the proposed scheme,experiments are carried out on Activity Net Captions,TACo S and Chrades-STA dataset.Compared with other existing methods,we achieved better results.Through the analysis of model experiments,the effectiveness of the design has been confirmed.At last,we use our algorithm to implement the video clip tool based on language location.
Keywords/Search Tags:Cross-modal, feature confusion, self-attention, fine-grained
PDF Full Text Request
Related items