Font Size: a A A

Research Based On Local Spatiotemporal Features And Parts For Human Action Recognition From Videos

Posted on:2016-09-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:1108330479985504Subject:Instrument Science and Technology
Abstract/Summary:PDF Full Text Request
Human action recognition is an active research topic in computer vision field, which aims to use computers to analyze and understand videos by automatically analyzing videos and extracting information related to human actions. It has a large amount of potential applications, such as intelligent video surveillance, content-based video analysis, intelligent monitoring, and human-computer interaction. Recent years, great progress and fruitful achievements have been made in human action recognition. Because of its complexity and variety, the existing methods have various shortcomings and deficiencies, and researchers have been exploring the techniques for automatic human action recognition. As a classification problem, video feature extraction and representation plays an important role in efficient recognition of human actions. Considering the feature extraction and representation, the paper proposes some new methods and techniques by analyzing the existing ones. The main contents and contributions of the paper are summarized as follows:â‘  A kind of action representation based on contextual structural information is proposed. Bag of words model quantizes a local feature to its nearest visual word, resulting in large quantization error. Moreover, bag of words model for human action recognition is based on global statistics of visual words, neglecting any structural information about spatiotemporal interest points. To overcome these shortcomings, a posteriori probability coding framework is given and the typical coding methods are analyzed. Based on this framework, a new posterior probability coding method is proposed. It encodes the local features considering not only the spatial similarity between visual words and local features but also the linear similarity between them, leading to better capture of local manifold information. Based on the proposed coding method, a kind of feature, called cumulative probability histogram, is computed around the spatiotemporal interest points. It considers the spatial and temporal order distribution of interest points located in the context of each interest point as the description of its spatiotemporal structural information. Experiment results on multiple benchmarks show that, as a complement to local features, the cumulative probability histogram not only augments the action recognition performance, but also is robust.â‘¡ A kind of discriminative kernel dictionary-learning framework is proposed. Dictionary learning based on local features for recognizing human action consists of three independent steps, i.e., dictionary learning, feature coding, and pooling. It neglects the impacts of the three steps, leading to the learned coding coefficients non- discriminative. Moreover, traditional dictionary learning methods are trained in the linear space and have difficulty in processing the data with nonlinearities. To overcome these shortcomings, the proposed discriminative dictionary-learning framework fuses the three independent steps into a unified framework. By simultaneous optimization, it can efficiently alleviate the impacts of them and augment the discrimination of coding coefficients. At the same time, a linear classifier is jointly learned. Its kernel version is provided by employing the double dictionary model, which nonlinearly maps features into high dimensional feature space, augmenting its ability in processing nonlinear data, and experiment results show its effectiveness.â‘¢ A kind of action representation method based on changes of coding coefficients between video frames is proposed. Observing that the motion information in a video can be described by the changes of coding coefficients in video frames and statistics on changes of coding coefficients can effectively capture the motion information, especially the temporal change information. By dividing a video into multiple cells using a spatial pyramid model, motion information of a cell can be captured by a- frame statistics on the decrease and increase of coding coefficients between frames. The resulting histogram is further fed into a support vector machine with spatial pyramid matching kernel for final classification. Compared with other action recognition features, the proposed histogram feature is robust, very easy in computation, and independent on coding methods.â‘£ A new kind of discriminative model based on latent support vector machine is proposed. Videos are represented using dense spatiotemporal parts, and it can be observed that actions can be separated from each other by a set of discriminative spatiotemporal parts. Defining spatiotemporal part as latent variables, the proposed model can automatically learn and select a set of discriminative spatiotemporal part detectors by introducing group sparse regularizer into latent support vector machine, and non-discriminative spatiotemporal part detectors are automatically deleted. The incoherent constraints are employed to avoid the redundant discriminative spatiotemporal part detectors in the same class. Furthermore, the similarity of latent variables is employed to force the detected spatiotemporal parts in the same class more similar and coherent. Moreover, an iterative optimization method is proposed to fast compute the similarity-constrained latent variables. Visualization of detected spatiotemporal parts demonstrates that the detected spatiotemporal parts are discriminative. Experiment results show that the proposed model can achieve a better recognition performance.
Keywords/Search Tags:Human action recognition, context, discriminative dictionary learning, feature coding, latent support vector machine
PDF Full Text Request
Related items