Font Size: a A A

Video Contextual Information Excavation For Temporal Action Localization

Posted on:2024-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:X L CuiFull Text:PDF
GTID:2568307100964089Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Temporal Action Localization(TAL)task is a popular research task in the field of computer vision,which aims to detect the start and end of action segments and action categories in untrimmed videos.Most of the current works do not fully exploit the multi-temporal information and video context information.Therefore,this paper proposes a time-series action detection algorithm based on video context information mining.This algorithm mainly includes two aspects of work:(1)A Multitemporal Scale and Spatial-Temporal Transformer Network for Temporal Action Localization(MSST).The algorithm first uses a refined feature pyramid(RFP)at different scales to transfer semantics from high-level scales to lowlevel scales.Second,to establish a long-term scale of the entire video,the algorithm uses a spatial-temporal transformer(STT)encoder to capture the long-term correlation of video frames.Then,the refined features with long-range dependencies are provided to a classifier for coarse motion prediction.Finally,to further improve the prediction accuracy,the algorithm proposes a frame-level self-attention(FSA)module to refine the classification and boundary of each action instance.Most importantly,these three modules are explored jointly in a unified framework,and MSST has an anchor-free end-to-end architecture.Extensive experiments show that the proposed method can achieve 54.2% on the THUMOS14 dataset and 34.1% on the Activity Net1.3 dataset.(2)Temporal Channel Enhancement and Contextual Excavation Network for Temporal Action Localization(TCN).Most previous methods use classifiers and locators to act on the same feature so that the classification and localization processes are relatively independent.Therefore,if classification results and localization results are fused,there will be a problem that the classification results are correct while the localization results are wrong,resulting in inaccurate final results,and vice versa.To solve the problem,a temporal channel enhancement and contextual excavation network(TCN)is proposed to generate robust classification and localization features and refine the final localization results.Specifically,a temporal channel enhancement(TCE)module is employed to enhance the temporal and channel information of the feature sequence.Then,the temporal semantic contextual excavation(TSCE)module establishes relationships between frames because frames in different positions provide potential information for action localization.Finally,A fusion of classification and localization module(FCL)gains final localization results by combining robust classification and localization features.Our design demonstrates a significant improvement over previous work.Extensive experiments show that our proposed method outperforms all state-of-the-art methods by 68.6% on the THUMOS14 dataset while achieving comparable performance on the Activity Net1.3 dataset by 36.8%.
Keywords/Search Tags:temporal action localization, multi-temporal feature, temporal channel enhancement, contextual information excavation
PDF Full Text Request
Related items