Font Size: a A A

Research On Computer Vision Based Human Action Recognition Technology

Posted on:2016-08-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:N J LiFull Text:PDF
GTID:1108330503477870Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Computer vision based human action recognition has profound applications in video surveillance, video retrieval, human-interaction, etc., and is the research focus of computer vision area in recent years. Thanks to the decades of devotion of researchers from home and abroad, this area has achieved remarkable development. However, due to dissatisfying imaging conditions and large intra-class variances, there are still lots of problems to be solved for human action recognition:e.g. lack of unified and effective feature description and model representation, the "semantic gap" between low-level image features and high-level semantics, efficient machine leaning methods, etc. The work of this dissertation is closely related to action recognition. Aiming at the difficult points of this area, the dissertation makes improvements and innovations mainly from aspects of "feature extraction" and "machine learning", proposing some valuable solutions. The main contributions of the dissertation are summarized as follows.1) Multi-feature fusion and hierarchical BP-AdaBoost algorithm based action recognition. Most works adopt support vector machine (SVM) as the discriminative classifier, whereas artificial neural network (ANN) is rarely used. To explore the performance of ANN in action recognition, a hierarchical BP-AdaBoost based action recognition system is designed: the standard binary AdaBoost algorithm is extended to a multi-class version; to further reduce the training complexity and confusions, a hierarchical recognition framework which includes pre-decision and post-decision modules is proposed; to exploit the complementarity of features, the system combines multiple motion and shape features. Experimental results demonstrate the merits of ANN over SVM in aspects of training time and recognition accuracy, and that the hierarchical recognition framework can largely reduce the training cost and confusions among actions, considerably enhancing the recognition rate.2) Fast UOG3D and self-organization feature map (SOM) based action recognition. Early action recognition literature uses benchmark datasets with simple backgrounds; nowadays there are many realistic action datasets which contain "cluttered backgrounds", and recognition rates on those datasets are generally low. The dissertation employs currently the most prevalent spatio-temporal interest points (STIP) as local features, and proposes a new action recognition framework that combines Fast HOG3D and SOM, constructing a more compact and computational effective local descriptor than original HOG3D. Furthermore, the dissertation successfully applies SOM to realistic action recognition, and its training parameters’influence is closely studied. The experiments show that Fast HOG3D can enhance the computational efficiency and meanwhile preserve the discriminativity of original H0G3D to a large extent; and that SOM is comparable with bag-of-words (BoW), and obviously outperforms local feature based SVM in both aspects of recognition rate and robustness to label noise.3) Huffman coding and implicit action model (IAM) based action recognition. To exploit the complementarity among multi-features,5 channels of descriptors are adopted to describe the SITP, and codebooks corresponding to each channel are generated through hierarchical agglomerative clustering. The common BoW and SVM based recognition scheme totally ignores the context among local features, thus much useful information is lost and the recognition performance is also limited. The IAM based action recognition method proposed by this dissertation has extraordinary performance, because it takes the spatio-temporal statistical relationships between STIP and center interest points into account. On the other hand, Huffman coding can ignore minor probability differences while preserve large ones, and is more tolerable to probability estimation errors of visual words, thus outperforms Naive Bayesian (NB) method which uses the conditional probabilities of visual words directly. Besides, multiple mechanisms such as "hierarchical codebooks", "sparse coding" and "feature fusion" are integrated to further enhance the systematic performance. The experiments confirm the effectiveness of Huffman coding and IAM based action recognition, and show that the recognition system, which integrates the two methods and adopts multiple features, outperforms other literature.4) Random forest and spatio-temporal correlation based action recognition. To better cope with interaction recognition, the spatio-temporal constraints among local features must be fully discovered and exploited. This part of work still takes STIP as low-level features, on the basis of which two kinds of mid-level features are generated:motion context (MC) and STIP co-occurrence sequences. MC skips STIP description and directly observes STIP’s spatial distribution to form histograms which are used to train random forest. Genetic algorithm (GA) is applied to the training of decision trees for the first time, and proves to be a good compromise between performance and efficiency for decision tree training. On the other hand, STIP co-occurrence sequences capture the temporal co-occurrence of local features, and can be fitted into biological sequence matching algorithms to calculate the temporal correlation between videos; whereas the spatial correlation between videos are calculate by MC and STIP codebook based histogram intersection kernel. Experimental results show that using the same codebook, spatio-temporal correlation based method outperforms BoW and pLSA in both performance and efficiency.5) Multichannel trajectory features and data mining (DM) based action recognition. Current works have already demonstrated the advantages of trajectories, which are higher-level than STIP because they contain the spatio-temporal contextual information of STIP. The dissertation employs optical flow based method and combines dense sampling as well as STIP detection to extract multi-scale trajectories. The trajectories are described by 6 channels of local features plus 12 channels of global features and then clustered; DM and K nearest neighbor (XNN) classification are applied to local and global feature based recognition, respectively. Action datasets are modeled into "transaction databases", and DM is adopted to mine frequent trajectories and frequent trajectory clusters which serve as action models. The experiments show that DM is more effective than SVM and is comparable with BoW and KNN based classification; and that comparing with other works, feature fusion based DM achieves the state-of-the-art performance.
Keywords/Search Tags:computer vision, action recognition, feature extraction, feature fusion, machine learning, data mining
PDF Full Text Request
Related items