Font Size: a A A

Human Pose Estimation And Action Recognition From Image Sequences

Posted on:2011-10-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:X X WuFull Text:PDF
GTID:1118360308455601Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Human action analysis and recognition is a highly active research area in the domain of computer vision and pattern recognition. It has many promising applications including human computer interaction, intelligent surveillance, visual reality and motion analysis. In this thesis, we focus on 3D human pose estimation and action recognition from image sequences. We mainly solve the problems of high dimensionality of pose space and ambiguity in the human pose estimation as well as feature representation and classifier design in the action recognition.A novel manifold learning method, called temporal neighbor preserving embedding (TNPE), is proposed to learn the low-dimensional intrinsic manifold of human motion in the learning-based framework for 3D human pose estimation. It alleviates the problem of high-dimensionality in both image feature and 3D pose space by exploiting the large constraints hidden in natural human motion. Bayesian mixture of experts (BME) is employed to establish the nonlinear mapping from the low-dimensional space to the high-dimensional pose space, and each expert handles a linear mapping in a local region. In order to calculate the gating of each expert, Gaussian mixture model (GMM) is used to approximate the probability distribution over the manifold space to obtain the prior probabilities and distribution models of experts. The experimental results on 3D hand and body pose estimation show an encouraging performance on both stability and accuracy.In order to alleviate the ambiguities caused by perspective projection from 3D scene onto 2D image plane, a novel framework based on semantic feedback for 3D human pose estimation is presented, which incorporates the high level motion knowledge to guide the pose estimation. A global temporal motion template is built to capture the temporal coherence between time-ordered poses. Local spatial motion correlations are created to preserve the nonlinear relationships between different body parts. The semantic knowledge is represented by both temporal motion template and spatial motion correlations, and is incorporated to rule out those implausible pose hypotheses and yield more accurate estimations. Experiments on the CMU Mocap database demonstrate that our method performs better on estimation accuracy than other methods without semantic feedback. A novel incremental leaning method, namely Incremental Discriminant-Analysis of Canonical Correlations (IDCC), is proposed and applied to the action recognition. It utilizes a discriminant matrix to project all the training actions to a new space, where the canonical correlations of actions within the same class are maximized and that of actions between different actions are minimized. To capture the large changes of human appearance undergoing various complex scenarios, the discriminant matrix of IDCC is incrementally updated with the new training data and thereby facilitates the recognition task in changing environments. Experiments on both regular and irregular action datasets demonstrate that our proposed method is able to recognize human actions with high accuracy and robustness in various non-stationary scenarios.A novel action descriptor based on spatio-temporal interest points is proposed for action recognition. It is represented by multiple bags of spatio-temporal distribution words to capture the spatio-temporal relationships between interest points over multiple local regions of different space-time scales in a video. A bag of appearance words is employed to capture the appearance information of interest points. Multiple bags of distribution words and a bag of appearance words respectively characterize the properties of"where"and"what"of interest points. A multiple kernel learning method is introduced to adaptively combine these two features to generate more descriptive and discriminative feature for recognition. The proposed method does not require any pre-processing of the action video such as object detection and human body tracking, and is robust to noise, camera movement and low resolution videos. Experiments on both single view and multiple view datasets show the effectiveness and robustness of recognition.
Keywords/Search Tags:human pose estimation, action recognition, manifold learning, semantic feedback, incremental discriminant learning, spatio-temporal interest points
PDF Full Text Request
Related items