| In order to meet the growing demand of video analysis and related applications,deep learning based temporal action detection method has attracted more and more attention.Based on the theoretical framework of object detection,this paper focuses on the description of temporal and spatial characteristics of human activities,scene information and context information modeling,and the construction of human action detection network framework.On the whole,this thesis studied action action detection from the following perspectives.1.Cascaded boundary-sensitive temporal proposal and multi-task learning for action detection.This method adopt the framework that first generates temporal proposals,and then make classification of them.In the temporal proposal generation stage,this thesis first evaluate whether each sampling frame contains action and whether it is the start boundary or end boundary of an action instance,obtaining three score sequences.Then construct a preliminary proposal set based on these three score sequences according to certain rules.Finally,filtering the temporal proposals generated in the previous step.In the temporal proposal classification stage,the entire video feature sequence is used to extract the features for each proposal,and then the overlap loss and the softmax loss are introduced to multi-task modeling.2.3D convolution and temporal region proposal based net for action detection.This method extends the 2D object detection framework(Faster RCNN)to 3D temporal action detection(R-C3D)for end-to-end training.Among them,three-dimensional convolution hierarchical network shares weights to extract task-related features of the entire video;temporal proposal subnetwork is used to generate temporal proposals;and action classification subnetwork is used to classify and fine-tune the generated proposals.The method train the network by optimizing both the classification and regression tasks jointly for the two subnets.3.Multi-scale based context-aware net for action detection.This method mainly considers the insufficiency of R-C3 D in processing context information of temporal actions.In order to find a suitable context for temporal actions of different scales,the context features of different resolutions and context scales are constructed.Then a two-branch structure is constructed with the idea of “candidate-control”.One branch generates contextual feature candidates,and the other branch controls the passing of candidate contextual features with a gate function(“sensitivity”).Subsequently,a lot of ablation experiments have been done to investigate the influence factors: the effects of different contextual regions,their scales and gate function selection.The results show that the proposed method can effectively integrate the contextual information of multi-scales,and further improve the detection performance of the whole algorithm(more than 10% improvement on THUMOS'14 dataset). |