| With the explosive growth of the number of new videos,video understanding plays an increasingly important role in the field of computer vision.As one of the hot research of video understanding,temporal action detection plays a key role in many video analysis applications,such as action analysis,automatic driving,monitoring anomaly detection,video understanding,etc.The task goal of temporal action detection is to localize the start time and end time of the action and identify the categories of actions for the long-term video that has not been edited.In recent years,action recognition methods have achieved good results,but the prediction performance of the start time and end time of temporal action is not perfect.Temporal action detection has the problems of flexible action boundaries and different temporal duration scales,and there exist long-term temporal interdependencies between actions.Therefore,it is important to capture the temporal dynamic semantic features of actions.However,most temporal action detection methods ignore temporal semantic consistency between same actions.For these problems,this paper has carried out research on the following aspects.(1)This paper proposes a local-global temporal semantic dependency aware model for temporal action detection.The model captures the local contextual information of actions through temporal convolutional networks.At the same time,the self-attention mechanism is used to model the dependencies between actions in long-term sequences,so as to avoid temporal convolution only affecting the temporal relationship of actions within the receptive field range,which is sensitive to local noise and prone to incomplete candidate actions nomination questions.Since supervised contrastive learning can better understand higher-level semantic features,this paper further uses the category information of video actions to learn higher-level semantic features through supervised contrastive learning methods,forcing action features of the same category to be as similar as possible to different classes.The action features of the model are as far away as possible to improve the integrity of the actions,so as to achieve a more accurate prediction of the action boundaries.(2)We propose a Snippet-level supervised contrastive learning based Transformer model for temporal action detection.Considering the powerful performance of Transformer in temporal feature extraction,the model further uses Transformer-based global temporal encoding network to encode features to capture global temporal features.Consistent features for action semantics.Snippet features of the same action category are forced to be as close as possible and the Snippet features of different action categories are as far away as possible.This allows the model to localize complete action nominations.Extensive experiments are carried on two benchmark datasets,Activity Net-v1.3and THUMOS14.The experimental results show that the proposed models are effective and have achieved a significant improvement on the benchmark model. |