Font Size: a A A

Research Of Temporal Action Localization Algorithm Based On Weakly-Supervised Deep Learning

Posted on:2022-08-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y LiFull Text:PDF
GTID:2558307154968549Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Temporal action localization aims to localize temporal boundaries of action instances and identify the corresponding action categories in untrimmed long videos with only video-level labels,which is widely used in several fields,such as automatic driving,video surveillance,virtual reality,and video retrieval.Fully-supervised temporal action localization uses action category labels and action temporal labels as supervision information,and requires a lot of training data.However,in real-world,annotating the temporal labels of action instances consumes a lot of manpower and material resources.As a result,weakly-supervised temporal action localization that only requires action category labels is derived.To solve the above problem,this paper mainly focuses on weakly-supervised temporal action localization,the specific work flow and research results are as follows:We propose a multi-branch temporal action localization network called MTALNet,which contains both temporal fusion model and multi-branch attention model.The temporal fusion model maps the video features to the feature space for the task of weakly-supervised temporal action localization by fusing the local temporal related information and global temporal related information of the videos.The multi-branch attention model is managed to model the distinguishable actions,distinguishable background and ambiguous action in the videos.Based on the multi-branch attention weights,three temporal class activation sequences were constructed to optimize the action classification loss,so that the network could separate the distinguishable action features and distinguishable background features.Experimental results have shown that the proposed approach outperforms multiple state-of-the-art methods with an average localization precision of 29.6% at different Io U thresholds on the Thumos-14 dataset.Nowadays,most weakly-supervised temporal action localization algorithms aggregate distinguishable action features with high activation values to optimize the classification loss.As a result,the network would ignore the ambiguous actions in the videos which are difficult to classify,which makes it difficult to ensure the completeness of the localization results.To this end,we design an ambiguous action contrast loss function,refining the ambiguous action features under the guidance of distinguishable features,so that the network could perceive precise action temporal boundaries to avoid the temporal interval interruption.Combined with the proposed loss function,MTALNet outperforms previous methods on three weakly-supervised temporal action localization benchmarks including Thumos-14,Activity Net-1.2 and Activity Net-1.3.Specifically,the localization precision is improved by 1.5%,1.3%and 1.2% respectively.The visualization results have shown that the ambiguous action contrast loss function can effectively improve the misclassification of ambiguous actions and enable the network to capture more complete action segments.
Keywords/Search Tags:Weakly-supervised deep learning, Temporal action localization, Ambiguous action, Temporal class activation sequence, Attention mechanism
PDF Full Text Request
Related items