Font Size: a A A

Research On Human Action Recognition Fusing 2D CNN And Vision Transformer

Posted on:2024-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:X H ZhuFull Text:PDF
GTID:2568307142966209Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the increasingly widespread application of various video software such as Dou-yin,Kuai-shou and Watermelon Video,a large number of videos--some good and some bad are uploaded to the network every day,and the video stream shows an unbelievably explosive growth trend,therefore how to effectively process and analyze videos has gradually become a hot spot,making video understanding come into being at the right moment.As one of the important research directions of video understanding,human action recognition has also received extensive attention from the public,which has a wide range of application value in intelligent monitoring,automatic driving,medical care and so on.As a classic 2D CNN method for human action recognition,it can not only avoid the interference of background noise,but also improve the accuracy of human action recognition.However,the method has two issues that need to be solved: First,the lack of timing modeling capabilities;Second,the timing information is single,which makes it impossible to provide more effective information,and the human action recognition ability with small differences in information between frames needs to be improved.Therefore,regarding the above problems,this thesis proposes a human action recognition network integrating 2D CNN and Vision Transformer(Vi T).The main work content of this thesis is as follows:(1)In response to the lack of temporal modeling ability in 2D CNN,this thesis extends the visual Transformer model with direct simulation of longdistance interaction ability to video human action recognition and integrates it with 2D CNN networks.The Conv-Transformer1 network is designed for human action recognition.In this thesis,on the basis of channel attention,the global pooling is improved to max pooling and average pooling,and the improved 2D CNN network is constructed to extract the enhanced intra-frame spatial features by cascading with spatial attention,and then the twodimensional feature map embedding is converted into a one-dimensional vector as the input token of the Transformer encoder,and then the Transformer encoder is used to capture the time features between frames to better obtain the time context information,and finally MLP is applied Head classifies actions.(2)In response to the problem of single temporal information that cannot provide more effective information,this thesis designs a spatiotemporal ConvTransformer2 network for human action recognition by using a time encoder.Among them,the time encoder module consists of four improved Transformer encoders in series,which are fused into a global-local relationship block by adding a local relationship block on the basis of the standard multi-head selfattention.The global relationship block uses the standard multi-headed selfattention layer to model long-term action dependencies and obtain global information;The 1D time convolutional layer is used in the local relationship block to enhance the information representation of the token block to obtain rich local information.The spatiotemporal Conv-Transformer2 network first acquires spatial features through improved 2D CNNs.Then enter the time encoder module to obtain global-local information;Then the features of the output of the four improved Transformer encoders are fused;Finally,enter the classification module for classification.(3)The overall framework of human action recognition integrating 2D CNN and Visual Transformer is composed of improved channel attention mechanism,spatial attention mechanism and Visual Transformer architecture,and uses subsampling data to improve the accuracy of human action recognition.The Conv-Transformer1 network model was experimented on HMDB-51 and UCF-101,which were mainly scene information-based public datasets,and obtained recognition accuracy of 69.4% and 95.5%,respectively.The Conv-Transformer2 network model obtained recognition accuracy o f73.4% and 96.9% on the HMDB-51 and UCF-101 datasets,respectively.In order to verify the effectiveness of the improved Transformer encoder,experiments were carried out on the dataset Diving-48,which mainly focused on time information,and obtained a recognition accuracy of 76.0%.At the same time,the influence of each module in the 2D CNN and Visual Transformer architecture on network performance is analyzed,and ablation experiments are designed for the above two research schemes to verify the impact of removing each component on the performance.Experimental results show that the two network models,Conv-Transformer1 and Conv-Transformer2,can improve the performance of human action recognition.
Keywords/Search Tags:Human action recognition, 2D CNN, Channel-Spatial attention, Vision Transformer, Global-local relationship block
PDF Full Text Request
Related items