Human target tracking and action recognition have been a new hotspot in computer vision,which play a critical role in the fields of intelligent surveillance,virtual reality,smart home and human-computer interaction.To this end,researchers have proposed many solutions.However,in view of non-rigid property,asymmetry,polymorphism for human body,and the interference such as occlusion,illumination variation,scale change occurred in complex environment,there still exist many difficulties of robustness in human target tracking,human feature extraction and action recognition.Therefore,this thesis presents research on robust visual tracking and RGB-D multimodal human action recognition.On the premise of the accurate location and tracking of human target,the multi-modal data including RGB videos,depth information,and human skeleton sequences are fused to accomplish the task of human action recognition.The main contributions of this thesis can be summarized as follows:(1)This thesis proposes a robust target tracking algorithm based on a collaborative model with adaptive selection scheme.Specifically,based on the discriminative features extracted from positive and negative template sets using the feature selection scheme,a sparse discriminative model(SDM)is developed by introducing a confidence measure strategy.In addition,a sparse generative model(SGM)is presented by combining1 regularization with PCA reconstruction,which is effective to handle outliers and has strong representation power.In order to overcome the deficiency of the traditional multiplicative fusion mechanism,this thesis proposes an adaptive selection scheme based on Euclidean distance,which aims at detecting the degraded model during the dynamic tracking process and adopting corresponding strategies to construct a more reliable likelihood function.The proposed SDM and SGM are integrated in a Bayesian inference framework by the adaptive selection scheme.Furthermore,the template sets and PCA subspace are updated with different schemes to alleviate drift problem and enhance the proposed algorithm to handle appearance changes during dynamic environments.Quantitative and qualitative evaluations validate that the proposed method can achieve more robust performance compared with several state-of-the-art algorithms.(2)This thesis proposes a multimodal correlative representation learning(MCRL)model for RGB-D human action recognition.In feature extraction stage,a robust spatio-temporal pyramid feature(RSTPF)is proposed to capture dynamic local patterns around each human joint for RGB and depth data.The proposed descriptor integrates both spatial arrangement and temporal structures.To learn more compact and discriminative shared semantic features,this thesis proposes a multimodal correlative representation learning model.Specifically,a linear projection matrix is introduced for each modality data,which maps original multimodal low-level features onto a low-dimensional subspace.Then a quantization matrix is utilized to encode the shared components for all the projected features.Both subspace learning and shared feature mining are integrated in a modified multi-task learning framework.A supervised multi-task learning framework is formulated to jointly learn the low-dimensional subspace and shared features.An iterative optimization method is presented to solve the proposed model and obtain the optimal model parameters.Furthermore,by introducing a weight regularization matrix,an improved collaborative representation classifier(ICRC)is employed to perform computationally efficient action recognition.Experimental results on four RGB-D action datasets demonstrate the effectiveness of the proposed method.(3)This thesis proposes a collaborative multimodal feature learning(CMFL)model for human action recognition from RGB-D sequences.The proposed CMFL model utilizes the supervised matrix factorization to factorize the three modality data to learn shared features to discover their latent connections,and learn modality-specific features to describe the intrinsic and unique characteristics of each modality.The shared features and modality-specific features can be complementary for each other,which provides more discriminative semantic features.Both shared-specific features mining and action classifiers learning are integrated in a unified max-margin framework,which can adapt the feature learning to classification.To solve the proposed CMFL model,an iterative optimization method is presented to obtain the optimal model parameters.Extensive evaluations on four RGB-D action datasets validate that the proposed method can achieve better performance compared with several state-of-the-art algorithms.Moreover,the experimental results show that the proposed CMFL model can transfer useful priori knowledge from training samples to testing samples so that the proposed method performs well even if one or two modalities are not available in the testing stage. |