Font Size: a A A

Weakly Supervised Temporal Action Detection In Untrimmed Video

Posted on:2022-08-26Degree:MasterType:Thesis
Country:ChinaCandidate:H HeFull Text:PDF
GTID:2518306524993489Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,thanks to the advancement of artificial intelligence and the innovation of deep learning and neural networks,the field of computer vision has made great strides.From computer vision to human civilization,a lot of data is stored in video,so intelligent analysis and processing of video has become a popular choice.In intelligent video analysis,temporal action detection is an overly critical task.Given a long video of human activities that has not been trimmed,the learning model needs to detect the start and end times of the human behavior fragments,and to determine the category of each human behavior.At the same time,it is also a very tricky task.We can mark the human behavior segment category with its start and end time of the long video by naked eye for fully supervised model training,or just mark the action segment category for weak supervision model training.The former will take up a lot of manpower and material resources,and the latter one only need to grab a small amount of computing power.Therefore,the weakly supervised temporal action detection in untrimmed videos is particularly important for the optimization and integration of social resources.The task of weakly-supervised temporal action detection in untrimmed videos has attracted the attention of many researchers in the industry.In my opinion,this task has three important problems: 1)As a detection task of human action fragments,the key points of bones and other attributes of the person themselves are not used? 2)By calculating the cosine distance of the segment feature in each frame of a video,we found that the power of popular features is relatively weak,which fundamentally limits the model? 3)The action segment in the video has considerable background information interference,so it is very important to eliminate unnecessary background information of.Therefore,this article proposes a complete deep learning framework to solve the above problems: 1)In this paper,the temporal attention mechanism of human pose is designed.For each frame of the long video,it extracts the key points of the human skeleton and uses the human pose change to assist in the start and end time of the action segment in the fixed-length video.2)This paper designs an action capture branch.It extracts key human action semantic features and key human pose features for weighted feature fusion and obtains a long video feature representation with strong judgment power.3)This article designs a behavioral background separation module.This module designs a new loss function to learn the number of action instances contained in each action category of a long video,thereby enhancing the feature representation of temporally adjacent action instances in the video,and effectively eliminating behavioral background information.
Keywords/Search Tags:temporal action detection, weakly supervision, human pose attention, long video feature representation, background separation
PDF Full Text Request
Related items