| The task of temporal action detection aims to predict the boundaries of action instances and the category of each action instance in untrimmed videos.With the rapid development of mobile devices and the internet,the number of videos has been increasing rapidly,and temporal action detection is widely used in fields such as video recommendation,intelligent monitoring,and human-computer interaction.However,due to the diversity of actions,complexity of the background,and ambiguity of action boundaries,accurate temporal action detection is still an urgent problem to be solved.Therefore,studying efficient and accurate temporal action detection is of great significance.The Transformer model based on self-attention mechanism has demonstrated impressive results in image classification,object detection,and video understanding.Inspired by this success,Transformer networks have received increasing attention in the field of temporal action detection.Traditional Transformer self-attention mechanism focuses more on the correlation between individual features,while ignoring the correlation between different sizes of action segments,different segments,and context retrieval,as well as the elimination of background influence.Therefore,in this paper,a temporal action detection method based on differential and fully convolutional attention network is proposed to address the above problems.The specific work is as follows:(1)In order to retrieve the relevant components of action segments in the entire video sequence,this paper proposes a new multi-scale fully convolutional attention mechanism.The action segment is used as a convolution kernel,and the convolution operation is used to calculate its correlation with the video sequence as attention,and the multi-scale attention results are fused to obtain temporal action features.This method can effectively model the correlation between action segments and improve the localization accuracy.(2)To eliminate the influence of complex backgrounds on the performance of temporal action detection,this paper proposes a background-constrained action-background difference module.The module uses differential attention mechanism to enhance the discriminability between action and background features,and combines the action-background difference module with a fully convolutional attention network to form an end-to-end temporal action detection network.To verify the effectiveness of this method,experiments were conducted on standard datasets including THUMOS14,ActivityNet1.3,EPIC-Kitchens100,and Ego4D.The average mAP of this paper on THUMOS14,ActivityNet1.3,EPIC-Kitchens 100,and Ego4D datasets reached 64.5%,38.3%,26.0%,and 18.6%,respectively.Comparative experiments were conducted with some mainstream methods in recent years on the Thoumos14 dataset.The average mAP of this paper method increased by 26%,20.1%,12.5%,and 1.9%compared with BMN,TCANet,AFSD,and Actionformer algorithms,respectively.On the ActivityNet1.3 dataset,the average mAP of this paper method increased by 4.4%,2.8%,3.9%,and 2.7%compared with BMN,TCANet,AFSD,and Actionformer algorithms,respectively.Experimental results demonstrate the effectiveness of this method. |