| With the continuous development of multimedia information technology and Internet communication technology in recent years,as well as the rise of various live short video platforms,the video information has completely flooded our lives.In order to meet the needs of users more accurately and efficiently,research on video understanding and feature extraction has also emerged.Video contains the most complex information in all multimedia data,and a video can be regarded as an aggregation of multiple pictures.Compared with pictures,video not only has the features of the space dimension,but also the time dimension.This paper conducts in-depth research on the extraction of video spatial-temporal features.According to different design concepts,we divide video feature extraction into two research directions:mutually reinforced spatio-temporal convolutional tube and multi-scale spatial-temporal integration convolutional tube.The main research contents and innovations are as follows:(1)Mutually Reinforced Spatio-Temporal Convolutional TubeWe propose a Mutually Reinforced Spatio-Temporal Convolutional Tube(MRST)towards robust and accurate human action recognition.Specifically,the MRST consists of a spatio-temporal decomposition unit,a mutually reinforced unit and a spatio-temporal fusion unit.The spatio-temporal decomposition unit extracts spatial and temporal features with a 2D convolution and a 1D convolution,respectively.The mutually reinforced unit learns the correlation between spatial and temporal information,and utilizes the correlation to reinforce the discriminative capability of the spatial features and temporal features.The spatio-temporal fusion unit selectively emphasizes informative spatial and temporal features and fuses them to get the effective spatial-temporal feature maps.Experiment results show our proposed deep MRST-Net achieves state-of-the-art performance on multiple datasets.(2)Multi-Scale Spatial-Temporal Integration Convolutional TubeWe propose a novel and efficient Multi-Scale Spatial-Temporal Integration Convolutional Tube aiming at achieving accurate recognition of actions with lower computational cost.Specifically,the MSTI tube consists of multi-scale convolution block and cross-scale attention weighted blocks.The multi-scale convolution block divides the input tensor into several groups,and each group has their own convolutional filters.The cross-scale attention weighted blocks integrate multi-scale spatial and temporal features,aiming at selectively emphasizing informative spatial-temporal features and suppressing less useful ones.Benefiting from the two blocks,our MSTI-Net requires significantly less computational resources yet achieving the state-of-the-art action recognition accuracy. |