Font Size: a A A

Study On Natural Language Video Moment Retrieval Methods

Posted on:2023-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:J Y TengFull Text:PDF
GTID:2568306617952879Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As we enter the age of information explosion,with all kinds of information coming to our eyes,most people rely more and more on visual media information.With the rise of short videos,videos on online sites are also surrounded by multiple forms of natural language text,e.g.,video comments,author video descriptions,editorial video recommendations,etc.Unlike images,people need to aggregate information in the temporal dimension,and view the video repeatedly to get more accurate target clips.To address this demand,the task of natural language-based video moment retrieval has been proposed.Natural language video moment retrieval is an emerging multimodal video retrieval task that aims to locate the start and end times of the clips corresponding to the query statement.This task is widely used in video surveillance systems,security systems,film,and television creation.It is divided into two main approaches,strongly supervised learning and weakly supervised learning,according to the nature of label supervision.The former provides accurate moment label information,while the latter has only video-sentence level label information.In order to explore the task more comprehensively,we present a series of research works on both learning approaches.Given that weakly supervised learning does not provide accurate moment labels.To guide the model to capture video clips that best match the text description,a dual granularity loss function is designed in this thesis,considering both video-level and clip-level relationships.Specifically,we first generate rough video clips and consider each video clip as an instance.For video-level regularized multiple instance loss,we exploit the potential alignment relationships between all intra-video clips and text descriptions.We then view this process as a supervised learning task under noise labels to classify these clips.With an instance-level regularized loss function,our model learns to correct for noisy instance-level labels in order to find more accurate frame boundaries from all positive instances.The combined experimental results on the ActivityNet and DiDeMo datasets show that the proposed loss function creates a new level of state-of-the-art.For the strongly supervised learning task,previous common methods simply fuse video and query sentence modalities without paying sufficient attention to the global and local relationships,in addition to which most methods require sliding windows to pre-segment fixed-length video clips,resulting in the loss of video information.In this thesis,we design a dual-stream Transformer-based natural language video localization method,i.e.,we utilize the features of Transformer for global and local interaction to mine video information.Transformer-based encoder structure with cross-modal attention and self-attention mechanisms.At the encoder level,video features are tuned by the query vector,so that visually relevant frames are given more attention.Since the dual-stream model is parallel,this layer can also simultaneously guide the video features to highlight the decisive words in the query sentences.For the final localization step,the boundary regression-based approach is adopted to predict temporal locations,which not only improves accuracy and avoids redundant operations,but also allows localization at arbitrary video lengths and maintains temporal consistency.The comprehensive experimental results on two open-source datasets,ActivityNet and Charades-STA,show the effectiveness of the proposed method in this thesis.
Keywords/Search Tags:Natural language video moment retrieval, video understanding, weakly supervised learning, strongly supervised learning, Transformer
PDF Full Text Request
Related items