Font Size: a A A

Video Action Recognition Based On 2D Convolution Network Under Spatio-Temporal Feature Enhancement Mechanism

Posted on:2022-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:S M GongFull Text:PDF
GTID:2518306527484374Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Video action recognition is the comprehensive use of computer vision,pattern recognition,image processing,artificial intelligence,and many other knowledge and technologies to automatically analyze the image sequence recorded by the camera without human intervention to realize the human body positioning,tracking,and recognition in dynamic scenes,and analyze and judge human action on this basis.The goal is to obtain the semantic description and understanding of action through the analysis of action characteristic data.Video action recognition technology can be used in fields such as autonomous driving,human-computer interaction,smart security monitoring,and smart home monitoring.Therefore,the research on video action recognition has important and extensive significance.This paper conducts further research and analysis on the existing 2D convolutional neural network that cannot extract the spatio-temporal feature information between input frames.The main research results obtained are as follows.(1)Building spatio-temporal interaction channel attention module to improve video action recognition.By analyzing the shortcomings of the existing channel attention mechanism,a simple and effective attention module of spatio-temporal interaction channel is proposed under the framework of deep learning.This module is embedded into the existing basic network Res Net50 to build a more effective action recognition network.This module transforms the dimensions of the input features by the dimension reconstruction operation firstly.Then,size reconstruction is used to compress the individual features of each frame into new features,and convolution operation is used to extract spatio-temporal feature information.Then the feature is normalized and matrix multiplied with the previous feature to realize information compression.Finally,the feature recalibration in channel dimension is realized by element multiplication.The recognition accuracy of the action recognition network built by proposed module is 95.51% on the UCF101 dataset and 74.71% on the HMDB51 dataset.(2)Establishing spatio-temporal double-branch parallel attention module to improve the accuracy of video action recognition.To solve the problem that the existing attention module can not extract spatio-temporal feature information or has numerous computation,an efficient spatio-temporal double-branch parallel attention module is proposed under the framework of deep learning,which can be directly embedded into the mainstream basic network to enhance the ability of the network to extract features.The spatio-temporal attention module proposed in this part is composed of the channel-temporal branch and space-temporal branch in parallel.In the channel-temporal branch,the spatial information series of each channel is extracted by multi-scale pooling,and the attention weight of each channel is obtained by convolution operation.Then,the weight value in the temporal dimension is obtained by matrix operation and Softmax.Finally,the feature recalibration is realized by element multiplication.In the space-temporal branch,the spatial information of all channels is compressed into two feature maps by maximum pooling and average pooling,and then the feature map containing temporal information is obtained by matrix operation and Softmax and mapped to the original feature map.The action recognition network based on proposed module achieves 96.14%recognition accuracy on the UCF101 dataset and 75.32% recognition accuracy on the HMDB51 dataset.(3)Improving video action recognition by designing temporal and channel shuffling module.To solve the problem that the existing 2D convolutional neural network can't extract the spatio-temporal feature information between input frames,a time and channel shuffling module is proposed under the framework of deep learning,and this module is embedded into the existing basic network Res Net50 to build a more effective action recognition network.Firstly,the preprocessed multi-frame images are input to the backbone network to extract the individual information of each frame and record it as the original information.Then,the designed time and channel shuffling module converts the independent input feature map into a new feature map with spatio-temporal correlation by matrix operation and extracts fusion information,which is recorded as spatio-temporal information.After that,the original information and spatio-temporal information are added and transmitted to the deep network to complete the action recognition task.Finally,96.16% recognition accuracy is obtained on the UCF101 dataset and 75.41% recognition accuracy is obtained on the HMDB51 dataset.(4)Constructing spatio-temporal feature pyramid module to improve video action recognition.An action recognition method based on spatio-temporal feature pyramid module is proposed under the framework of deep learning to enable 2D networks to extract time-series related information between input frames.For multi-frame image input,the backbone network first extracts the individual information of each frame and records it as the original information.Then,the spatio-temporal feature pyramid module designed in this part uses matrix operation and dilated convolution pyramid to extract temporal information with spatio-temporal correlation from the input feature map.After that,the original information and time sequence information is added according to weight and transmitted to the deep network.Finally,the action in the video is classified by a fully connected layer.At last,the recognition result of the network on the UCF101 dataset is 96.43% and the recognition accuracy on the HMDB51 dataset is 75.55%.
Keywords/Search Tags:video action recognition, deep learning, spatio-temporal attention, shuffling mechanism, spatio-temporal feature pyramid
PDF Full Text Request
Related items