Relation Aware Network For Weakly-Supervised Temporal Action Localization

Posted on:2022-05-22

Degree:Master

Type:Thesis

Country:China

Candidate:Y K Zhan

Full Text:PDF

GTID:2518306323478344

Subject:Cyberspace security

Abstract/Summary:

PDF Full Text Request

In recent years,temporal video action localization has become an important and challenging direction due to its extensive practical application and has received more and more attention.However,the method of Fully-Supervised action localizaction requires a lot of manpower expenditure to get framelevel or segment-level fine annotations on untrimmed long videos,which brings great limitations to the practical application of video action localization.In addition,with the booming of short videos on the Internet,users upload a large amount of video data with video text description.In order to reduce the workload of video annotation or directly use the video data saved on the Internet,the research of Weakly-Supervised action localization method is emerging gradually.Due to the lack of supervision information,Weakly-Supervised action localization faces two basic challenges now,namely,action completeness modeling and action-context confusion,which lead to poor model localization performance.In order to improve the performance of localization,in this paper,we analyze the reasons for the ex-isting two challenges,and design networks based on intra-video and inter-video relation modules respectively.For the challenge of actin completeness modeling,we propose a method for modeling intra-video relations based on graph convolution.Through the graph convolutional layer,a bridge for information interaction between each segment in the video and its neighborhood can be established,so that when the feature of a segment in the video is updated,other similar action segments can be fully considered,thereby improving the completeness of action prediction.For the challenge of action-context confusion,we propose a method for mod-eling inter-video relations based on a cross-attention mechanism.By sampling video pairs with the same label,different videos are embedded in the same fea-ture space,and the corresponding features are updated through the cross-attention mechanism.Make the action and action segment features match as much as pos-sible,and keep the action and non-action context as far away as possible.Such constraints make the model pay more attention to the action itself,rather than the context with a high confidence of correlation with it,and then separate the ac-tion from the context and improve the accuracy of the model's action localization prediction.Finally,we combine the two improved methods proposed,and propose a Relation Aware Weakly-supervised temporal action localization network.The network can handle the above two challenges simultaneously and achieve end-to-end training.Through experimental analysis on the three datasets of THUMOS14,ActivityNet1.2 and ActivityNet1.3,the accuracy and visualization results also indicate that the network proposed in this article can effectively alleviate the problems caused by the current two major challenges.Compared with the latest methods,the method proposed in this paper has achieved better results.

Keywords/Search Tags:

Temporal Action Localization, Weakly-Supervised Learning, Relation Modeling, Video Analysis

PDF Full Text Request

Related items

1	Research On Video-based Temporal Action Localization And Recognition
2	Deep Learning Based Temporal Action Localization
3	Temporal Action Localization In Massive Multimedia Video Scenario
4	Weakly Supervised Temporal Action Detection In Untrimmed Video
5	Research On Method Of Weakly Supervised Action Localization
6	A Research On Weakly Supervised Learning For Video Segmentation And Action Recognition
7	Research On Weakly Supervised Temporal Action Detection Algorithm
8	Research On Weakly Supervised Human Action Analysis Based On Deep Learning
9	Research On Temporal Action Location Method Combining Light And Heavy Networks In Untrimmed Video
10	Research On Video Temporal Action Localization Based On Deep Learing