Font Size: a A A

Research On Action Recognition Algorithm Based On Spatio-Temporal Feature Representation

Posted on:2023-03-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z W YiFull Text:PDF
GTID:2558307100475444Subject:Electronic Science and Technology
Abstract/Summary:
With the fast increasing of online videos and the rapid development of machine vision techniques,action recognition in video has become an important technology in the fields of video surveillance,human-computer interaction,and video content retrieval.Video contains rich spatio-temporal information,and the key to determining the performance of video action recognition lies in the algorithm’s ability to model spatial and temporal features.However,the structure between human action and corresponding video data is complex and changeable,and there is often strong visual correlation between video objects,but there are obvious semantic differences between categories,this leads to the current deep learning models facing many challenges in learning spatio-temporal features,such as information redundancy,large model parameters,and weak ability to describe motion changes.In response to these problems,this thesis deeply studies the action recognition algorithm from the aspect of spatiotemporal feature representation,aiming to learn powerful spatio-temporal features,achieve efficient fusion of spatio-temporal features,and promote the development and deployment of action recognition models in practical applications.The main work of this thesis is as follows:(1)Aiming at the characteristics of uneven spatial-temporal distribution and redundancy of video information,a plug-and-play lightweight spatial-temporal attention module ST-AM is proposed.By analyzing the key information within and between frames of the video to directly determine the subordinate attribute of action,it is proposed to use spatial attention module and temporal attention module to emphasize or suppress the intra-frame spatial information and the inter-frame temporal information,so as to better help the model learn keyframe and intra-frame spatial information related to the action category.The experimental results show that the proposed spatio-temporal attention module combined with three-dimensional convolutional neural network can help to improve the ability of spatio-temporal feature representation,and the amount of model parameters brought can be ignored.(2)Aiming at the disadvantages of large parameters and high computational complexity of three-dimensional neural network model,a action recognition algorithm based on spatio-temporal feature fusion is proposed.By analyzing the characteristics of the three-dimensional orthogonality of the video,the degree of specificity of the information contained in the three views is learned.First,the idea of convolution kernel decomposition is introduced,and three sets of two-dimensional convolutions are used instead of three-dimensional convolutions to encode temporal and spatial features respectively.Secondly,the self-attention mechanism is used to determine the fusion degree of temporal features and spatial features,and the fusion scheme is designed to combine these features.The experimental results show that the proposed spatiotemporal feature fusion method improves the accuracy of the model and reduces the amount of model parameters,which proves the effectiveness of the spatio-temporal feature fusion.(3)Aiming at the large uncertainty of the speed of motion change,a Slow Fast network combined with spatio-temporal attention mechanism is proposed.The algorithm firstly utilizes the fast branch and the slow branch of the Slow Fast network to extract the fast-changing motion features and the slow-changing appearance features respectively,and then use the lateral connection between the two branches to continuously perform spatio-temporal feature fusion;Secondly,the spatio-temporal attention mechanism is used to enhance the ability of the Slow Fast network to describe the speed of motion changes in detail.Experimental results show that the proposed model achieves 93.7% and 68.5% accuracy on UCF-101 and HMDB-51 datasets,respectively,which proves the effectiveness of the algorithm.
Keywords/Search Tags:Action recognition, Convolutional neural network, Attention mechanism, Spatio-temporal feature fusion
Related items