Font Size: a A A

Research On Spatiotemporal Information Fusion And Attention Enhancement Based Human Action Recognition

Posted on:2022-09-20Degree:MasterType:Thesis
Country:ChinaCandidate:H Y DengFull Text:PDF
GTID:2518306527978749Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Human action recognition is one of the most important research topics in the fields of artificial intelligence,pattern recognition and machine learning.It is a popular research topic of computer vision and multimedia analysis.It has significant academic value and great application value in the fields of security monitoring,human-computer interaction,medical diagnosis,video classification and so on.In the early,researchers have made great progress in human action recognition.While in practical application,the data for human action recognition is often interfered by illumination,complex background,occlusion,human body itself and other factors.For these reasons,research on human action recognition is always a very challenging topic.The existing human action recognition methods aim to improve the internal structure of single-data flow network,but ignore the information interaction,fusion and enhancement between multiple-data flow networks.To tackle the above problems,this thesis studies this topic in two folds: multi-level spatiotemporal information fusion and multi-branch attention based information enhancement.The main contributions and achievements of this thesis can be summarized as follows:(1)A multi-level spatio-temporal information fusion based human action recognition method is proposed.In order to take advantage of the multi-level spatio-temporal features effectively,a multi-level spatio-temporal information compact fusion module is proposed.The module can reduce the dimension of spatiotemporal features,and it can make information interaction and fusion between the spatial features and temporal features.Moreover,it solves the problem that the compact bilinear pooling algorithm can not directly fuse the spatiotemporal features of multi-convolution layers.A three-stream prediction score fusion network is introduced,of which branch networks are separated,aims to relieve the influence of fusion operations on feature extraction networks.And it utilizes the temporal segment network for long-range temporal structure modeling.Experiments on two RGB video-based human action recognition datasets,UCF101 and HMDB51,prove the method of this paper can achieve excellent recognition performance.(2)A multi-receptive field spatial-channel attention for feature enhancement based human action recognition method is proposed.On the theoretical basis of the method proposed in the previous chapter,a multi-receptive field spatial-channel attention module is introduced to adjust each part of the fusion feature and make the network focus on the effective information area of the input data.The module combinates the spatial branch and the channel branch in parallel style to generate the feature attention adjustment weight.Meanwhile,the spatial branch of the module uses convolution operations with different convolution kernels to expand the information receptive field of the spatial branch.In addition,the residual connection of the module enables it to achieve plug-and-play in the network.Experiments on UCF101 and HMDB51 indicate the proposed method achieves satisfying recognition accuracy.(3)A multi-perspective feature fusion enhancement for skeleton data based human action recognition method is proposed.A multi-perspective feature fusion enhancement module is introduced to strengthen and fuse skeleton data.The module combinates the spatial branch,the channel branch and temporal branch in parallel style.When the input is the same,the module can be utilized as an attention module to enhance the input data and extract more effective features.While the inputs are different from each other,the module can be employed as a spatiotemporal information fusion module.It captures the effective information provided by one input data to strengthen the information of the other input data,so as to accomplish the information fusion.The module is used to strengthen the feature extraction network with graph convolution and to fuse the spatiotemporal features of multi-layer with graph convolution.In addition,a skeleton-diff data extraction method is proposed to make full use of the temporal dimension information in skeleton data.Combining the first-order information data and secondorder information data of skeleton data,a three-stream fusion network based on skeleton data is proposed.Experiments on skeleton-based human action recognition datasets,KineticsSkeleton,NTU-RGBD60 and NTU-RGBD120,show the proposed method is effective.
Keywords/Search Tags:Human action recognition, Feature fusion, Attention enhancement, Multistream network, Multi-level feature
PDF Full Text Request
Related items