The temporal action detection task is one of the current research hotspots in computer vision.This task can be defined as: taking naturally captured video clips as input,the output video contains the start time and end time of a specific action clip(temporal action proposal generation),and the specific category of that action(action recognition).In this paper,the temporal action proposal generation algorithm and the temporal action detection algorithm are explored and studied respectively.Regarding the temporal action proposal generation task,for the problem that the existing methods are difficult to accurately locate the action start and end boundary points,this paper proposes a temporal action proposal generation network based on Boundary Prediction-Precise(BP-P).First,BP-P improves the accuracy of action boundary localization by fusing the locallevel action features and proposal-level action features in the video sequence to more fully utilize the feature changes at the background and action demarcation points.Secondly,BP-P proposes a new loss function "Free-Focal Loss" to address the problem of imbalance between positive and negative samples and difficult and easy samples in the training process.The "FreeFocal Loss" can effectively improve the balance of the contribution of Io U samples in different intervals when the network weight gradient is updated.Finally,to address the problem that the large gradients of difficult samples in the joint training of classification and regression tasks are detrimental to the training,"Balanced L1 Loss" is introduced to improve the regression gradients of accurate samples.To demonstrate the effectiveness of the BP-P model on the temporal action proposal generation task,experiments are conducted on the publicly available dataset Activity Net-1.3.The experimental results show that BP-P can increase the AR@100metric from 75.01% to 76.56%,which is comparable to the current best performance on this dataset(76.75%).Regarding the temporal action detection task,the current one-stage framework has the advantage of high efficiency,while the two-stage framework achieves high precision.In order to inherit the advantages of both,while avoiding their disadvantages,this paper introduces the idea of fusing the one-stage and two-stage frameworks in the object detection RefineDet algorithm to the temporal action detection task for the first time,and proposes the 3D RefineDet temporal action detection algorithm.The algorithm constructs a 3D detection network applicable to video features by temporal generalization of 2D modules.To demonstrate the effectiveness of the 3D RefineDet algorithm on the temporal action detection task,experiments are conducted on the publicly available dataset THUMOS-14.The experimental results show that the 3D RefineDet algorithm achieves significant effect improvement at multiple Io U thresholds for the m AP@t Io U metric,and improves the m AP from 50.1% to 53.6% when the Io U threshold is taken as 0.3,an improvement of 3.5 percent. |