| Convolutional neural networks have emerged as the most widely used deep learning technology for performing action recognition tasks.However,convolutional neural network-based action recognition models have certain limitations,such as insufficient feature extraction capabilities and insufficient representation of critical characteristics.In this research,three action recognition models are suggested in terms of feature extraction,feature representation,and convolutional neural network exploitation of local and global features:This research proposes an action recognition model integrating channel attention and knowledge distillation to overcome the problem of action recognition latency in two-stream 3D convolutional networks during model reasoning.The model introduced channel attention into 3D Res Next101,input continuous video frames,combined with a knowledge distillation strategy to build a weighted linear combination loss function,and trained both the teacher model and the student model,with the teacher model guiding the student model to learn important information,resulting in a better student model recognition effect.During the evaluation,the student model does not employ optical flow,minimizing the amount of processing.Experiments on the UCF101 and HMDB51 data sets show that the suggested model is feasible and valid.Given the massive computational load imposed by the two-stream design of 2D convolutional networks,this research offers an action recognition model based on spatiotemporal and motion features.The model inputs picture sequences using a sparse sampling method and incorporates a time shift module,spatiotemporal excitation module,and motion excitation module into 2D Res Net50 to integrate encoded videos’ spatiotemporal and motion features.In addition,the model employs the spatiotemporal feature enhancement module to improve the spatiotemporal information of the input video clips,resulting in a more substantial recognition effect.The model performs well on the data sets Something-Something V1 and Something-Something V2.To address the problem of the action recognition model’s weakness in modeling local and global information in the video,this research suggests an action recognition model that combines spatiotemporal multi-scale convolution with self-attention.This model employs a sparse sampling method,adds a spatiotemporal grouping multi-scale module in2 D Res Net50,and adaptively learns essential global video information via self-attention.This model performs well in recognition on the video action recognition data sets Something-Something V1 and Diving48. |