Font Size: a A A

Research On Natural Language-based Video Moment Retrieval

Posted on:2023-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:X F LiuFull Text:PDF
GTID:2568306614484574Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of hardware and software devices,a large amount of video data is generated every day through the perception system and network based on video monitoring.Video content understanding and analysis has important application value in people’s livelihood,security,transportation,entertainment and other fields.In video content understanding and analysis,natural language-based video moment retrieval is one of the important research contents.Natural language-based video moment retrieval refers to finding a segment from the video that matches the query sentence.Natural language-based video moment retrieval involves text understanding and visual understanding.It is necessary to ensure the efficiency of retrieval and the adaptability to different scenarios with high retrieval accuracy,which is difficult and challenging.Although the video understanding community has done a lot of research on this in recent years,there are still many problems to be solved in terms of text and visual modality interaction and complexity,accuracy and interpretability of event time boundary location.Focusing on the research topic of natural language-based video moment retrieval,this thesis conducts in-depth research on cross-modal feature interaction and time boundary regression,and applies related technologies and methods to the specific application scenario of vehicle video retrieval,which has achieved favorable results.The main work of this thesis is summarized as follows:(1)A single-shot semantic matching network for moment retrieval is proposed.To solve the high complexity problem caused by matching natural language statements with the proposals from the original video,a lightweight network is proposed to avoid the complicated calculation in traditional methods and the duration limit of the moment.It sampled the video evenly and then predicted the matching score between each frame and the query,based on which the time boundary is predicted.The model has high precision,simple structure and high retrieval efficiency.(2)An explicit correlation based convolution boundary locator for moment retrieval is presented.In natural language-based video moment retrieval,boundary probability-based localization method is a research hotspot,and it has good applicability for any length of video.However,the traditional boundary localization method has some problems such as poor convergence and insufficient explanability.To solve these problems,a sliding convolution locator is proposed,which can predict the boundary probability by sliding the convolution kernel over the explicit matching score.It has better accuracy,convergence and explanability than traditional feature-based boundary localization methods.(3)A dual-path temporal matching network for natural language-based vehicle retrieval is proposed.Aiming at the application demand of using natural language to query vehicles in large-scale road monitoring video,this method uses recurrent neural networks to model video information and language information,and improves the use of training samples to adapt to cross modal retrieval.The dual-path temporal matching network can easily adapt to video corpus retrieval,and won the second place in the AI City challenge of CVPR2021.
Keywords/Search Tags:video content understanding and analysis, video moment localization, single-shot semantic matching network, convolution boundary locator, dual-path temporal matching network
PDF Full Text Request
Related items