Font Size: a A A

Study On The Skeleton-based Human Action Recognition Model

Posted on:2022-02-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:F J LiFull Text:PDF
GTID:1488306731999529Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The rapid development of communication technologies such as 5G makes video more and more the main carrier for information presentation and dissemination.A large amount of video information is beyond the processing range of the human brain,and it is urgent to solve the increasing demand for visual perception computing with the powerful analysis and processing capabilities of computers.Human action recognition,as an important branch of visual perception computing,has very important research significance as its results can be used in many fields such as intelligent video surveillance,intelligent medical care,human-computer interaction,and unmanned driving.Traditional action recognition methods mainly use RGB video as input,but RGB video has problems such as large data volume,low semanticity,and easy to make the model interfered by irrelevant information such as background,while skeleton sequence only records the position coordinates of human joints,which has small data volume,high semanticity,and does not record irrelevant information such as background,and the robustness of model representation is strong.Therefore,the study of human action recognition models based on skeleton sequences has received increasing academic attention.Among them,the method of using graph convolutional neural network encodes the skeleton sequence into a spatio-temporal graph based on the physical structure of the human body,and then performs the extraction and classification of spatio-temporal features,which achieves high recognition accuracy.However,the existing methods still suffer from three problems,such as too homogeneous temporal modeling scale,insufficient spatial modeling capability and poor coupling of spatio-temporal features.To this end,four network models for human skeleton action recognition based on graph convolutional neural networks are proposed in this dissertation to systematically address these problems,as described below.(1)In this dissertation,a multi-stream and enhanced spatial-temporal graph convolution network named MS-ESTGCN is proposed,a)in which multiple temporal convolutional kernels of different sizes are used to extract multi-scale temporal features,and dense connections are used to connect each temporal graph convolutional sub-layer to achieve the reuse and aggregation of temporal features.b)A two-branch spatial enhancement structure is designed,in which a spatial graph convolution branch is added to the basic block of MS-ESTGCN to enhance the extraction of spatial features.c)To make full use of low-level features,MS-ESTGCN uses four types of spatial information(joints,bones,and their relative positions)and two types of temporal information(speeds of joints and bones)as inputs to form a six-stream framework,which can significantly improve the network performance.With a parameter number of 37.8M,MS-ESTGCN is leading the industry in action recognition accuracy.For example,the recognition accuracy under the cross-subject evaluation method reaches 91.4% in the NTU-RGB+D 60 dataset.(2)In this dissertation,an enhanced spatial and extended temporal graph convolutional network named EEGCN is proposed,a)in which a one-shot aggregation method is used to connect multiple temporal graph convolutional sub-layers to extract multi-scale temporal features while significantly reducing the number of connections between the sub-layers.b)A pseudo-two-stream spatial enhancement structure is designed,in which one pseudo-stream can enhance static spatial features and the other pseudo-stream can enhance dynamic temporal features,further improving the network performance.c)This dissertation also introduces a channel attention module to reassign channels for spatio-temporal feature maps to achieve better coupling of spatio-temporal features.The parameter of EEGCN is17.2M,and its recognition accuracy is 91.6% under the cross-subject evaluation method of the NTU-RGB+D 60 dataset.(3)In this dissertation,a single-oriented pyramidal graph convolutional network named SPGCN is proposed,a)in which a single-oriented pyramidal graph convolutional structure is proposed to extract temporal features,which captures different levels of temporal information through a diverse pool of temporal convolutional kernel types.b)A pseudo-two-stream spatial enhancement structure based on a shared graph is designed to set the spatial graph convolution layers in the basic block of SPGCN to share the same adaptive graph,which can reduce the number of parameters while maintaining the performance.c)This dissertation proposes the use of two loss functions,cross-entropy and pairwise Gaussian,which can maximize both the interclass separability and intraclass compactness of actions.The parameter of SPGCN is 11.2M,and its recognition accuracy is 91.1% under the cross-subject evaluation method of the NTU-RGB+D 60 dataset.(4)In this dissertation,a frequency-driven channel attention and full-scale temporal modeling network model FF-TMN is proposed,a)in which a full-scale temporal modeling method is proposed where each temporal graph convolutional sublayer achieves more comprehensive temporal modeling by employing all available convolutional kernels in the range from 1 to 9.b)This dissertation also proposes a frequency-driven channel attention module that embeds the spatial and temporal features of the feature map into the global channel descriptors by using different strategies,i.e.,global average pooling and discrete cosine transform,to achieve better coupling of spatio-temporal features.The parameter of SPGCN is 5.0M,and its recognition accuracy is 91.2% under the cross-subject evaluation method of the NTU-RGB+D 60 dataset.
Keywords/Search Tags:machine learning, action recognition, multi-scale modeling, channel attention
PDF Full Text Request
Related items