Font Size: a A A

Dynamic Gesture Recognition Based On Spatio-temporal Feature Representation And Dictionary Optimization

Posted on:2015-08-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:J WanFull Text:PDF
GTID:1228330467972163Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Gesture recogniton is an important branch of human computer interaction. Owing to the RGB-D camera (e.g. Kinect) lauched by different companies, gesture recogniton based on RGB-D data has gained a lot of attention in recent years. In this disseration, two problems are involved, which are how to extract spatio-temporal feature and how to learn the compact and discriminative dictionary. The main contributions are summarized as follows:1. For one-shot learning gesture recognition, two important challenges are:how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. For feature extraction, a new spatio-temporal feature representation called3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. Compared with other features, the new feature set is invariant to scale and rotation, and has more compact and richer visual representations. For learning a discriminative model, all features extracted from training samples are clustered with the k-means algorithm to learn a visual codebook. Then, unlike the traditional bag of visual words using vector quantization (VQ) to map each feature into a certain visual codeword, a sparse coding method named simulation orthogonal matching pursuit (SOMP) is applied and thus each feature can be represented by some linear combination of a small number of codewords. Compared with VQ, SOMP leads to a much lower reconstruction error and achieves better performance. The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round2).2. We propose a spatiotemporal feature named three-dimensional sparse motion scale-invariant feature transform (SIFT) from RGB-D data for gesture recognition. First, we build pyramids as scale space for each RGB and depth frame, and then use Shi-Tomasi corner detector and sparse optical flow to quickly detect and track robust keypoints around the motion pattern in the scale space. Subsequently, local patches around keypoints, which are extracted from RGB-D data, are used to build3D gradient and motion spaces. Then SIFT-like descriptors are calculated on both3D spaces, respectively. The proposed feature is invariant to scale, transition, and partial occlusions. More importantly, the running time of the proposed feature is fast so that it is well-suited for real-time applications. We have evaluated the proposed feature under a bag of words model on the Chalearn Gesture Dataset. Experimental results show that the proposed feature outperforms other spatiotemporal features and are comparative to other state-of-the-art approaches, even though there is only one training sample for each class.3. We propose a novel approach called CSMMI which aims at learning an optimal dictionary for each class. Unlike traditional dictionary-based algorithms which typically learn a shared dictionary for all of the classes, we unify the intra-class and inter-class mutual information (MI) into an objective function to optimize class-specfic dictionary. The objective function has two aims:(1) Maximizing the MI between the selected dictionary items and the rest of dictionary items in a specfic class (intrinsic structure);(2) Minimizing the MI between the selected dictionary items in a specfic class and the dictionary items of the other classes (extrinsic structure). The complexity is an issue for CSMMI, so that the reader can know that the submodular method is an important contribution to signficantly reduce the complexity. Experimental results show that the proposed method outperforms shared dictionary methods and are comparative to other state-of-the-art approaches.
Keywords/Search Tags:spatio-temporal feature extraction, bag of visual words, sparse coding, dictionary learning, RGB-D data, 3D EMoSIFT, 3D SMoSIFT, CSMMI, Kinect
PDF Full Text Request
Related items