| Video-based action recognition is a very popular research topic in computer vision.It has important research significance not only in academia,but also in industry.It has great practical value and economic value in intelligent monitoring system,smart healthcare,autonomous driving,video understanding and human-computer interaction and so on.Off-the-shelf videobased action recognition techniques mainly focus on how to extract spatio-temporal features efficiently from video,such as utilizing the two-stream framework to learn spatial and temporal features respectively,increasing the depth of network to improve network modeling capabilities,researching different fusion strategies to explore interactions of two streams,adopting LSTM or sparse sampling strategy to capture long-range dependencies,introducing three-dimensional convolution kernels to solve the problem of temporal feature extraction,etc.However,these methods still have some crucial problems,such as insufficient inter-modality relationship,weak feature representation ability,high computational cost and so on.How to extract spatial and temporal features efficiently is still a challenging and highly significant problem.The performance of recognition systems largely depends on whether it can extract and exploit effective information.Joint spatio-temporal feature learning is the key to video-based action recognition.In this paper,video-based action recognition is taken as the research object.This paper proposes spatio-temporal feature learning to analyze complementarity between spatial and temporal dimensions and key action characteristics in video.Combined with deep models for action recognition and the attention mechanism,in this paper,three action recognition algorithms are proposed and implemented.The main work and achievements of this paper are as follows:(1)This paper proposes a multi-layer adaptive spatio-temporal feature fusion for action recognition.In order to learn the complementary relationship of input modality and effectively integrate spatial and temporal features from the two-stream architecture,an adaptive feature fusion(AFF)module is proposed.AFF module essentially adopts a dynamic gating fusion strategy based on input features.Adaptive feature weights are learned through compressed channel dimension,and different types of features are flexibly fused according to the importance of the features themselves.The fusion is not limited to a single network layer.It is performed on multiple network layers.From the network layer of spatial stream and temporal stream,multi-level feature maps are integrated to interact under different receptive fields.This paper takes the feature fusion stream as the third tributary to capture the difference between appearance and motion,and transforms them into effective spatio-temporal information to supplement the original spatial network and temporal network.The multi-layer adaptive spatiotemporal feature architecture greatly enhances the feature representation capability of the network.Extensive experiments on UCF101 and HMDB51 benchmark datasets demonstrate that the proposed method achieves competitive results.In addition,the proposed AFF module is generic and effective,which can be exploited in a plug-and-play manner as a method of feature fusion.(2)This paper proposes a cross-dimension multi-size attention enhancement mechanism for action recognition.Based on the multi-layer feature fusion architecture,the action recognition method is studied from the perspective of "3D interaction".Cross-dimension multisize attention(CDMA)module is added in series.It consists of a multi-view attention(MVA)module and a channel attention(CA)module in parallel.Among them,MVA module employs three branches architecture to explore the three-dimensional interaction of spatio-temporal information and uses dimensional interaction to fully mine spatio-temporal features,which enhance feature representation.By embedding multi-scale attention(MSA)units,different sizes of receptive fields are generated for the fused features,and multi-scale information is collected to enrich the extracted features.CA module makes up for the lack of attention to channel information in the three branched architecture.It complements the correlation between the two dimensions of space and channel.Experiments on UCF101 and HMDB51 datasets show that the combination of the two modules(MVA&CA)effectively improves the recognition accuracy of the deep model.In addition,the proposed CDMA module provides a concise idea of modeling three-dimensional structure for spatio-temporal learning in two-dimensional convolutional neural network.At the same time,CDMA can be embedded into the mainstream network architecture as a general method of feature enhancement to improve the feature extraction ability of the network.(3)This paper proposes a dynamic selection and acceleration strategy of feature fusion weights for action recognition.In the multi-layer feature fusion architecture,the action recognition method balances accuracy and computation from the perspective of network pruning.By analyzing the channel number configuration,parameters and computational cost of multi-layer feature fusion architecture,the dynamic selection and acceleration(DSA)strategy is proposed as a result of the redundancy of feature fusion weight parameters.The DSA strategy is utilized to dynamically select the weights generated by the adaptive feature fusion module in a spatio-temporal coordinated way.The method is skillfully designed to prune channels.The unselected weights and the feature maps of the corresponding channels are completely deleted to accelerate the subsequent convolution operation.In addition,this paper also innovatively designs the feature fusion weight salience binary function and level-wise adaptive normalization function.The former is employed to generate binary mask sequence to filter features.The latter flexibly uses the number of channels in the network layer to adjust the feature fusion weight of the corresponding layer.This method effectively reduces the size of the model and almost does not lose precision while greatly reducing the amount of calculation.Experimental results on UCF101 and HMDB51 datasets show the effectiveness of the proposed method. |