Font Size: a A A

Research On Temporal Action Localization And Sentence Query Localization In Videos

Posted on:2022-12-23Degree:MasterType:Thesis
Country:ChinaCandidate:J H YuFull Text:PDF
GTID:2518306752454344Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the massive amount of video content being generated every day,traditional manual analysis is far from being able to handle this volume of tasks,so automated video analysis algorithms have become very important.In the field of video analysis,identifying action categories of edited video clips is an important task.However,most of the videos are untrimmed long videos and contain both actions and backgrounds.The focus of this paper is on the temporal action localization in such videos.In addition,since traditional action localization methods cannot identify complex action combinations,details,or human-environment interaction in videos,this paper also research the video localization by sentence query task,i.e.,locating the action that matches the description of a sentence in a given video.This task is also more general because a query sentence can specify a variety of details such as scenes,subjects and objects,and other attributes of the actions in the video.For the above two video localization methods,the following studies are conducted in this paper.(1)For the video temporal action localization task,a self-attention assisted ranking network is proposed to address the problem that existing boundary detection methods are not accurate enough in calculating the confidence score of the generated proposals during action localization.Given that the characteristics of videos are suitable for selfsupervised learning,this algorithm combines a discriminative constraint and a generative constraint so as to train the self-attention,and computes the confidence score for each action proposal based on this weight to assist its ranking and retrieval.Experiments on the THUMOS14 dataset show a large improvement in the average recall of the method with a small number of proposals.(2)For the video localization by query sentence task,a network combining global and local information is proposed to extract the global features of the video using a dense feature map,and output frame-level temporal localization results accurately to address the problem that traditional boundary detection methods are based on local information only and have a small receptive field.In addition,the model uses probabilistic methods and proposes the use of soft labels to address the problem that action boundaries are not obvious in this task and boundary detection methods are difficult to directly train and accurately output the temporal boundaries of query sentences.Experiments on the Charades-STA dataset also confirm the effectiveness of the method.Based on the above work,this paper designs and develops a video action localization prototype system,which allows users to select the appropriate localization method according to different scenarios and accurately locate the action in the video to meet the demand for action localization in practical applications.
Keywords/Search Tags:deep learning, computer vision, video understanding, action detection, action localization
PDF Full Text Request
Related items