Font Size: a A A

Research On Video Action Recognition Based On Spatial-temporal Feature Fusion

Posted on:2023-04-09Degree:MasterType:Thesis
Country:ChinaCandidate:H ChenFull Text:PDF
GTID:2568306788455404Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of computer vision and artificial intelligence technology,human motion recognition by computer has become one of the hot topics in the field of artificial intelligence.Its main task is to extract valid spatial and temporal features from the input video,and then classify the different features according to the principle of similar similarities and dissimilar exclusions.At present,motion recognition has been widely studied in many applications such as intelligent monitoring,human-computer interaction and motion detection.This paper summarizes and summarizes the current main motion recognition algorithms based on space-time feature fusion,and proposes three space-time feature fusion methods for extracting video features with strong expression ability and robustness.The following is the main work of this paper:(1)This paper first summarizes the main algorithms based on in-depth learning into three categories: two-dimensional convolution neural network algorithm,three-dimensional convolution neural network algorithm and self-attention mechanism based structure algorithm.Then,the advantages and disadvantages of different methods in structure design are analyzed,and their recognition effects on UCF101,HMDB51,and omething-Something datasets are compared according to their experimental data analysis.(2)To compensate for the disadvantage that two-dimensional convolution networks cannot link feature context to learn global information,this paper uses self-attention mechanism to enhance learning of advanced feature maps of convolution network output along time and space dimensions,respectively.Multiscale features are extracted by replacing self-attention linear transformation matrix with spatial one-dimensional convolution in different directions and time series onedimensional convolution in different void rates,which enriches the model’s ability to express features.Through the analysis of ablation experiments,it is concluded that the space-time convolution attention designed in this paper can effectively improve the recognition accuracy of the model.(3)In order to solve the high computational complexity of three-dimensional convolution,the output features of two-dimensional space convolution and one-dimensional time series convolution are fused by compression excitation along the time dimension,and then the space-time feature interaction module is designed.The experimental results show that the recognition performance of this fusion method is significantly improved compared with both 3D and P3 D convolutions.(4)In order to make the feature map achieve better space-time feature interaction in the propagation process,a multilevel feature aggregation module is designed based on the channel hierarchy and the connection of peer residuals.The module divides the convolution network on the channel dimension,links the input of the three-dimensional convolution with the output residuals of the two-dimensional convolution to extract the video-level features of the larger field of view,and then aggregates the common results of the two-dimensional and three-dimensional convolution on the channel dimension to enrich the diversity of the features.In addition,a dynamic information enhancement module is designed,which enhances the feature weights of dynamic regions along the time dimension and weakens the interference caused by unrelated information.The ablation experiments show that both modules have the ability to improve classification accuracy in motion recognition tasks.
Keywords/Search Tags:Action recognition, Convolution network, Self-attention mechanism, Deep learning
PDF Full Text Request
Related items