Video action recognition and detection is one of the representative tasks for video understanding,and it has a broad application prospect in daily life.In the past few years,the introduction of deep learning technology has made new progress in this topic,but due to the complexity of video data and the uncertainty of human action itself,it is still difficult to build efficient recognition or detection models.This paper makes an in-depth study of action recognition and detection methods based on deep learning,and improves the shortcomings of existing methods.The main work of this paper is as follows:First,we propose an improved action recognition method from TSN.Aiming at the problem of the lack of association between each sampling group of the TSN network,a forget-gate connection module is designed.The forget-gate structure of LSTM is used to establish a feature-level connection between each group,and the sampling group is integrated from the time dimension to enhance each sample.Information transmission between packets improves the connectivity in the time dimension.Improved the feature fusion method of spatial flow and time flow,using ConvLSTM to connect the feature extraction network: superimpose the output features of the above-mentioned dual-stream network,and then use ConvLSTM to learn the long-term spatiotemporal dependence of the feature,which solves the previous dual-stream + cyclic neural network fusion The method will destroy the drawbacks of spatial features.The improved model was tested on the two data sets of UCF101 and HMDB51.The results show that the improved action recognition method has a significant improvement compared with the original TSN algorithm,and achieves a recognition accuracy rate comparable to the latest method.Second,we proposal a spatiotemporal action detection model based on fused non-local block.The model uses a double-branch convolutional neural network structure to analyze the spatial and temporal information of video.The spatial network takes the single video frame image as input to extract the appearance features of the current video frame.At the same time,the spatiotemporal network inputs the sequence of video frames to extract the spatiotemporal features of the video.Aiming at the problem that the convolutional neural network lacks the ability to understand time-domain information,the spatiotemporal network uses the threedimensional convolutional neural network that integrates non-local block to capture the global connection between video frames.In order to further enhance the context semantic information,a channel fusion mechanism is used to aggregate the features of the two branch networks,and finally the fused features are used for frame-level detection.The model is verified by experiments on UCF101-24 and JHMDB datasets,and the results show that the proposed model can fully integrate spatial and temporal dimension information,and has a high detection accuracy in video-based spatial-temporal action detection tasks. |