Font Size: a A A

Research On Human Action Recognition From RGB-D Images

Posted on:2020-05-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y ZhaiFull Text:PDF
GTID:1368330605981276Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Human behavior recognition has a wide application in intelligent monitoring,human-computer interaction,virtual reality and video analysis,which receives wide concerns of the academia and industry.The traditional human action recognition based on RGB images is much easily influenced by illumination change,local occlusion,shadow and complex background etc.With the advent of low-cost and easy manipulated RGB and Depth(RGB-D)sensors,i.e.Microsoft Kinect,more and more researchers begin to working on human action recognition with depth images.Comparing with RGB images,depth images are robust to the change of illumination,shade and other surroundings,but they are insufficient in color and texture information,which also play an important role in action recognition.Thus,utilizing the complementary characteristics of both images can boost the performance and robustness of human action recognition.However,RGB-D data also bring some new challenges,such as latent semantic relationship between them,significant differences of RGB and depth images.Towards these challenges,many scholars have done a lot of work on human action recognition with RGB-D images and achieved tremendous results.Nevertheless,there are still some problems in these existing methods:(1)due to manual designing,traditional low-level features can't well describe an action with different shooting scene,illumination,human poses.Meanwhile,due to the appearance differences between RGB and depth images,some low-level features of RGB images have limited ability to describe the texture,edge and shape of depth images.(2)Most multimodal correlation learning methods neglect the complex topological structure between multi-modal data,leading to the joint representation not full convey the semantic relationship between original multi-modal data.(3)Existing deep learning methods based on the Siamese network usual construct larger training pairs than original training data to learn the semantic relationship between RGB and Depth data,which makes the process of semantic consistency learning time-consuming.To settle existing problems in RGB-D action recognition methods,this thesis takes the semantic consistency between RGB and Depth data as latent information,and proposes some correlation learning methods for RGB-D action recognition.The main contents of this thesis are as follows:(1)In the feature extraction,a feature extraction method of RGB-D action data is proposed based on the coupled binary feature learning and an associated constraint term.Firstly,in view of the traditional 3D LTP can't obtain continuously changed spatio-temporal appearance between human motion,a 3D pixel differences(depth differences)vector calculation method is studied based on the pixel changes of multiple adjacent frames in local RGB and depth blocks.Then,based on those pixel and depth differences vectors,a coupled binary features learning method with an associated loss term is proposed to conquer weak generalization ability and limited descriptions of 3d binary features in depth images,which can largely reduce the difference of binary features on both data.Experiments on three rgb-d action datasets,containing hundreds of sample data,show that the global spatio-temporal texture features obtained by the local binary features and VLAD encoding model have significant recognition performances for behaviors with fixed scenes and few intra-class changes.(2)In the feature representation,a jointly multi-modal features representation method with multi-graph constraints is proposed for RGB-D action recognition.Towards the same semantic information between RGB and Depth data for the same action,a multi-modal data joint learning method is proposed by factorizing them into a common shared subspace through double non-negative matrix factorization.For the topological structure of samples in different modal data,two sparse graph construction methods are proposed by using sparse representation model and graph similarity theory.Then,these sparse graph regularization constraints are added to the double non-negative matrix decomposition model.Abundant experiments on four datasets with different sizes verify that our multi-modal features joint expression method can not only effectively improve the recognition performance of complex human behaviors with single RGB or depth image sequences,but also used to distinguish some similar human actions with RGB-D data.(3)In action regcognition,a RGB-D action recognition method is proposed based on two-streams Siamese 3D CNN network.In order to solve the problem of intra-class differences and inter-class similarity of behaviors between same and different modality data,a semantic measure method for deep features of those data is present based on the Siamese 3D CNN and novel contrastive losses.To conquer constructing lots of training pairs and long training time in Siamese network with crude contrastive loss,the reference samples for each class and modality data are introduced,and two contrastive-center losses are developed based on the transitivity of distance relationship.Through experiments on the NTU RGB+D dataset and two RGB-D gesture datasets show that the proposed Siamese 3D CNN can be used to identify some human behaviors or actions under multiple variation factors within each action(shooting perspective,lighting,and shooting background).Meanwhile,compared with crude contrastive loss,the Siamese 3D CNN network with developed contrastive-center losses has a great advantage in computation efficiency.This thesis proposes various multimodal feature extraction and repre-senttation methods under different conditions,which not only improve the efficiency of semantic correlation representation between RGB and Depth data,but also improve the performance of RGB-D action recognition.Besi-des,the research results of this paper are a good exploration and attempt for the semantic consistency research of multi-modal data,and provide an important reference for the semantic explore of cross-modal data.
Keywords/Search Tags:RGB-D images, human action recognition, multi-modal semantic correlation, data fusion, coupled binary feature learning, non-negative matrix factorize-tion, Siamese Network, contrastive-center loss
PDF Full Text Request
Related items