Font Size: a A A

Leveraging Semantic And Visual Structures For Video Temporal Grounding Research

Posted on:2022-01-16Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y WuFull Text:PDF
GTID:2518306557977249Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the popularization of portable digital devices and the development of mobile Internet,video big data is showing an explosive growth trend.In order to meet the ever-increasing information interests of users,effective cross-modal video retrieval is urgently required.However,since different users have different interests,for a video,it is possible that users are only interested in some of the segments.In this case,given a natural language query sentence to accurately locate semantically relevant content has become an important technology to promote all aspects of people's lives,and it has also become a research hotspot in the field of multimedia retrieval.The goal of video temporal grounding(VTG)is to learn the semantic relevance between a given query sentence and a video segment.Existing methods can be generally divided into two categories:(1)Two-stage method,which first generates some video clips of different lengths as candidates in a sliding window manner,and then sorts the results according to the similarity between these candidates and the query sentence.(2)One-stage method,which directly predicts to the time interval of the target video to obtain the localization result.However,most of the above mentioned methods directly use the global features of the query sentence to perform cross-modal matching,and seldom explore the fine-grained semantic structures for video temporal grounding.In fact,because there is a huge semantic gap between the low-level visual information and the high-level semantic information,it is very important to explore the semantic structure-driven video temporal grounding approaches.This thesis focuses on how to design an effective semantic structure extraction method to tackle video temporal grounding.In this thesis,starting from the cross-modal association between natural language query sentences and videos,two methods are designed to improve the accuracy of video temporal grounding.The main contributions and innovations are as follows:(1)This thesis proposes a unified framework for video temporal grounding,which considers to simultaneously encode semantic and visual structures.Specifically,a semantic role tree is built to reveal the fine-grained semantic information by generating hierarchical textual embeddings.Then the semantic structure is adopted to facilitate the visual structure learning with a contextual attention-based proposal interaction module.Finally,we adaptively aggregate and obtain the visual-semantic matching information through a multi-level fusion strategy to select the best matching moment proposal.(2)Furthermore,this thesis proposes a video temporal grounding method by jointly considering explicit and implicit semantic structures.In order to depict the explicit semantic structure,we adopt the semantic role tree structure introduced above to model the content of the query statement explicitly.In order to portray the implicit semantic structure,and adaptively and dynamically learn implicit and complementary semantic information from the sentence,we design a cascaded implicit semantic information extraction module,which consists of a dynamic attention mechanism.Finally,the explicit and implicit semantic features share the same video visual feature representation,and the final video cross-modal feature is obtained through the isomorphic cross-modal fusion module.Under the optimization of the joint classification and regression loss functions,the whole model can be trained in an end-to-end manner.(3)Extensive experiments on two popular benchmarks(Charades-STA and Activity Net Captions)demonstrate that our proposed approaches achieve significant improvement against the current state-of-the-arts.
Keywords/Search Tags:Video Temporal Localization, Semantic Structure, Semantic Role Tree, Visual Structure Modeling, Implicit Semantic Structure
PDF Full Text Request
Related items