Video action recognition is a hot research topic in the field of computer vision.With the rapid development of artificial intelligence technology in recent years,video action recognition has wide application value in the areas of intelligent video surveillance,medical monitoring,and automatic driving.However,due to various factors such as complex backgrounds,camera movement,and human pose changes in behavior in the real environment,the video action recognition task becomes more challenging.In this paper,we address the current problems in action recognition by using spatio-temporal feature modeling.The primary research is as follows:(1)In this paper,we first review the video action recognition task into two broad categories:action recognition methods based on manual feature extraction and action recognition methods based on deep learning.Among them,the manual feature extraction methods are subdivided into overall features and local features,and the deep learning methods are divided into methods based on twostream convolutional networks,3D convolutional networks,recurrent neural networks and Transformer.A summary comparison analysis of current video action recognition methods is presented for the reference of related researchers.(2)Since the optical flow information in the two-stream network lacks the ability to capture long-distance temporal relationships,the 3D convolutional network has a large number of parameters and converges slowly.In order to learn more perfect spatio-temporal features,this paper proposes an multi-dimensional feature activation residual networks.The multi-dimensional feature activation residual networks use 2D convolutional networks to solve the problem of learning temporal feature expressions,using motion supplement excitation module to model temporal features and excite temporal channel motion information;meanwhile,using united information excitation module to incentivize channel and spatial information by temporal features in order to learn better temporal feature expressions.MFARs on the behavior recognition datasets UCF101 and HMDB51 achieved an accuracy of 96.5% and 73.6%,respectively.By comparing with current mainstream behavioral recognition models,the proposed multidimensional feature excitation method can effectively represent spatio-temporal features and obtain a better balance of complexity and classification accuracy.(3)In order to solve the problems of computational complexity and large number of parameters in 3D convolutional networks and Transformer methods,this paper introduces a self-attention mechanism based on 2D convolution and designs a long-short temporal feature fusion network to model temporal features.The long and short temporal features are modeled separately by different modules to suppress irrelevant information such as background and focus on motion regions,so as to improve the accuracy of video action recognition.And the effectiveness of the model is verified on two different datasets,UCF101 and Something-Something V1,and the analysis of ablation experiments concludes that the network can have the ability to improve the classification accuracy in action recognition tasks. |