Font Size: a A A

Multi-shot And Zero-shot Learning For Human Action Recognition

Posted on:2019-02-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y TianFull Text:PDF
GTID:1368330551458109Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Human action recognition is of great importance in computer vision field as it has important theory value and lots of potential applications.Recently,our government has attached great importance to this task.In this dissertation,we focus on the research of human action recognition for social security,which comes from a National Key Basic Research Program of China.There are two important properties of social videos.On one hand,social videos always contain realistic environments and complex human actions.On the other hand,they also involve huge number of unlabeled videos and emergent unseen actions.Based on the two properties,two aspects are mainly studied in this dissertation:'multi-shot action recognition in realistic conditions'and 'zero-shot action recognition in unlabeled videos'.Firstly,for multi-shot action recognition in realistic conditions,one of the key topics in this dissertation is to integrate video sequences into robust and discriminative representations by fully using all the useful information of local features.On the basis of spatio-temporal local features and bag of words(BoW)model,we propose two kinds of sparse coding based methods for multi-shot human action recognition,which improve the performance of existing methods.The main contributions are summarized as follows.(1)Context and locality constrained linear coding for multi-shot action recognition.A sparse coding method named context and locality constrained linear coding(CLLC)method has been proposed to encode local features.Group-wise sparse representation based classification method(GSRC)is then applied to classify testing videos via their sparse codes.CLLC makes full use of local correlation and contextual information of spatio-temporal features via its locality and context constraints.It overcomes two typical limitations of existing methods,i.e.large quantization error and losing contextual information,which achieves better recognition performance.(2)Hierarchical and spatio-temporal sparse representation for multi-shot action recognition.A hierarchical encoding method has been proposed in this section.In the first layer,locally consistent group sparse coding(LCGSC)method is designed to encode local features within the same video.It captures global and local correlations of local features via group sparsity and locally consistent constraints.Two kinds of location estimation methods(absolute and relative location methods)are then proposed to characterize spatio-temporal information of local features.In the second layer,LCGSC is applied to encode the videos belonging to the same action at different levels of abstractions.It generates much more discriminative representations of videos.The proposed recognition method solves all the limitations of existing methods,including large quantization error,losing contextual information,independent encoding,unordered encoding and single-layer encoding.It further improves the recognition accuracy of complex human actions.Secondly,for zero-shot action recognition in unlabeled videos,we aim to learn an appropriate visual-to-semantic mapping,which projects unseen videos into a proper semantic space.Two kinds of temporal dynamic-preserving methods for zero-shot action recognition have been proposed,which predict some emergent unseen human actions accurately.The main contributions are summarized as follows:(1)Max-margin structural regression for zero-shot action recognition.We build a max-margin structural SVM model to learn a discriminant function that maximizes the compatibility between video sequences and their corresponding semantic representations.Global and local sub-models are designed to ensure the reliability of classification and capture temporal information of video sequence.The proposed method overcomes the limitation of traditional zero-shot learning methods that neglect temporal information of video sequences,which achieves better performance.(2)Aligned dynamic-preserving embedding for zero-shot action recognition.We learn linear visual-to-semantic mappings for source and target domains,respectively.Firstly,temporal factor is applied to capture temporal information of video sequences.An adaptive embedding of target videos is then learnt to explore the underlying distributions of both source and target data.An aligned regularization term is designed to align the semantic representations of target videos with their corresponding prototypes,which takes variations across different action categories into consideration.The proposed recognition method overcomes the main limitations of existing ZSL models,including losing temporal information,severe domain shift problem and losing inter-classes information,which further improves the recognition accuracy of unseen videos.As a summary,we exploit a Human Action Recognition System,which consists of two main modules:multi-shot action recognition and zero-shot action recognition.The system help readers have better and intuitional understanding of this dissertation.
Keywords/Search Tags:Human action recognition, spatio-temporal features, bag of words model(BoW), sparse coding, zero-shot learning, structural SVM, domain shift problem
PDF Full Text Request
Related items