Font Size: a A A

Content-Based Video Clip Retrieval

Posted on:2021-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:C J LuFull Text:PDF
GTID:2428330623969217Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nowadays the popularity of cameras and social networks has accelerated the generation and spread of visual media content,especially videos.Cross-modal video retrieval techniques have emerged as the times requires,which is a very hot area both for research and for commercial applications.In this thesis,we focus on the task querybased video localization: localizing(i.e.,grounding)a query in a long and untrimmed video sequence.All currently published models for addressing this problem can be categorized into two types:(i)top-down approach: it does classification and regression for a set of pre-cut video segment candidates;(ii)bottom-up approach: it directly predicts probabilities for each video frame as the temporal boundaries(i.e.,start and end time point).However,both two approaches suffer several limitations: the former is computation-intensive for densely placed candidates,while the latter has trailed the performance of the top-down counterpart thus far.This thesis analyzes the design flaws of the exisiting models and makes the improvement to the existing deficiency of the bottom-up models.The main research contents of this thesis are summarized as follows:1)We propose a novel dense bottom-up framework: DEnse Bottom-Up Grounding(DEBUG).DEBUG regards all frames falling in the ground truth segment as foreground,and each foreground frame regresses the unique distances from its location to bidirectional ground truth boundaries.In addition,we propose a temporal pooling trick to avoid unstable performance by considering multiple frame predictions.2)We design a novel bottom-up model: Graph-FPN with Dense Predictions(GDP).GDP firstly generates a frame feature pyramid to capture multi-level semantics,then it utilizes graph convolution to encode the plentiful scene relationships,which incidentally mitigates the semantic gaps in the multi-scale feature pyramid.Extensive experiments on four challenging benchmarks(TACoS,Charades-STA,ActivityNet Captions and Activity-VRL)show that our models are able to match the speed of bottom-up models while surpassing the performance of the state-of-the-art topdown models.
Keywords/Search Tags:Deep Learning, Video Localization, Computer Vision
PDF Full Text Request
Related items