As an important research topic in the field of computer vision,human action recognition in videos aims to recognize the actions of people in video scenes and determine their categories.Since video data is a three-dimensional data containing spatiotemporal information,it poses great challenges to feature extraction.Three-dimensional convolutional neural networks,as a successful technique for human action recognition in videos,can model video spatiotemporal information directly,thereby simplifying the difficulty of spatiotemporal information extraction.However,the three-dimensional convolutional neural network model has problems such as difficult optimization,insufficient feature extraction capabilities,and single-feature extraction lacking multi-scale features.In response to these problems,this paper plans to improve the C3 D three-dimensional convolutional neural network based on two main tasks:(1)Propose an improved method of the C3 D three-dimensional convolutional neural network based on multi-scale feature extraction structure and channel attention mechanism.The PPM(Pyramid Pooling Module)multi-scale pooling module can extract features of different scales compared to traditional convolutional str uctures,obtain more abundant information,and the channel attention mechanism can emphasize important channel features in the features.Using the multi-scale feature extraction module to replace the convolutional structure in the original network and introducing the channel attention mechanism to emphasize the extracted features,the network can obtain more rich and effective features,thereby better achieving video action classification.Experimental verification on the UCF-101 and HMDB-51 datasets shows that the proposed improved method has a certain degree of improvement in accuracy indicators,and has certain advantages in performance compared to currently more classic networks.(2)Propose an improved method of the C3 D three-dimensional convolutional neural network combined with a hybrid attention mechanism.The channel attention mechanism will compress spatial information when emphasizing features,resulting in the loss of spatial information.In response to this problem,the spatial attention mechanis m is introduced to improve the network.Based on the GCNet channel attention module,the 3D-Crisscross spatial attention module is introduced to construct a hybrid attention module.These two attention networks have global context modeling operations,which can establish remote dependency relationships for three-dimensional features,enhance the network’s feature extraction capabilities in channels and space,and improve the model’s modeling performance.Experiments are conducted on the UCF-101 and HMDB-51 large video datasets,and compared with other deep learning models.The results show that the proposed method has a relatively higher accuracy rate than other deep learning models,and has a significant improvement in effect compared to the original C3 D method.In summary,based on the C3 D three-dimensional convolutional neural network,this paper proposes two improved methods,which improve the network’s recognition performance.The effectiveness of the proposed improvement methods is verified through theo retical analysis and experimental results. |