Recently,remarkable progress has been achieved in video action detection by using deep learning techniques.However,for action detection in real-world untrimmed videos,the accuracies of most existing approaches are still far from satisfactory,due to the difficulties in temporal action localization.To tackle this challenge,we propose a spatiotemporal,multi-task,3D deep convolutional neural network to detect(including temporally localize)actions in untrimmed videos.First,we introduce a fusion framework which aims to extract video-level spatiotemporal features in the training phase.And we demonstrate the effectiveness of video-level features by evaluating our model on video action recognition task.Then,under the fusion framework,we propose a spatiotemporal multi-task network,which has two sibling output layers for action classification and temporal localization,respectively.To obtain precise temporal locations,we present a novel temporal regression method to revise the proposal window which contains an action.Meanwhile,in order to better utilize the rich motion information in videos,we introduce a novel video representation,interlaced images,as an additional network input stream.As a result,our model outperforms state-of-the-art methods for both action recognition and detection on standard benchmarks. |