Font Size: a A A

Relation Aware Network For Weakly-Supervised Temporal Action Localization

Posted on:2022-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y K ZhanFull Text:PDF
GTID:2518306323478344Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
In recent years,temporal video action localization has become an important and challenging direction due to its extensive practical application and has received more and more attention.However,the method of Fully-Supervised action localizaction requires a lot of manpower expenditure to get framelevel or segment-level fine annotations on untrimmed long videos,which brings great limitations to the practical application of video action localization.In addition,with the booming of short videos on the Internet,users upload a large amount of video data with video text description.In order to reduce the workload of video annotation or directly use the video data saved on the Internet,the research of Weakly-Supervised action localization method is emerging gradually.Due to the lack of supervision information,Weakly-Supervised action localization faces two basic challenges now,namely,action completeness modeling and action-context confusion,which lead to poor model localization performance.In order to improve the performance of localization,in this paper,we analyze the reasons for the ex-isting two challenges,and design networks based on intra-video and inter-video relation modules respectively.For the challenge of actin completeness modeling,we propose a method for modeling intra-video relations based on graph convolution.Through the graph convolutional layer,a bridge for information interaction between each segment in the video and its neighborhood can be established,so that when the feature of a segment in the video is updated,other similar action segments can be fully considered,thereby improving the completeness of action prediction.For the challenge of action-context confusion,we propose a method for mod-eling inter-video relations based on a cross-attention mechanism.By sampling video pairs with the same label,different videos are embedded in the same fea-ture space,and the corresponding features are updated through the cross-attention mechanism.Make the action and action segment features match as much as pos-sible,and keep the action and non-action context as far away as possible.Such constraints make the model pay more attention to the action itself,rather than the context with a high confidence of correlation with it,and then separate the ac-tion from the context and improve the accuracy of the model's action localization prediction.Finally,we combine the two improved methods proposed,and propose a Relation Aware Weakly-supervised temporal action localization network.The network can handle the above two challenges simultaneously and achieve end-to-end training.Through experimental analysis on the three datasets of THUMOS14,ActivityNet1.2 and ActivityNet1.3,the accuracy and visualization results also indicate that the network proposed in this article can effectively alleviate the problems caused by the current two major challenges.Compared with the latest methods,the method proposed in this paper has achieved better results.
Keywords/Search Tags:Temporal Action Localization, Weakly-Supervised Learning, Relation Modeling, Video Analysis
PDF Full Text Request
Related items