Font Size: a A A

Analyzing And Understanding Human Actions In Videos

Posted on:2016-11-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:C W LiuFull Text:PDF
GTID:1108330503953426Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Automatically analyzing and understanding human actions in videos is a highly active topic in the domain of computer vision and pattern recognition. It is widely applied in intelligent visual surveillance, human-computer interaction, video retrieval, and video summarization. This thesis focuses on solving the problems of action analyzing and understanding, including the extraction of mid-level features, joint segmentation and recognition multiple actions in long-time videos, semantic analysis of complex activities, and joint recognition and localization of actions.Firstly, a novel random forest learning framework is proposed to construct a discriminative and informative mid-level feature from multiple low-level features, under the direction of high-level semantic concepts. Densely sampled 3D cuboids are characterized by multiple complementary low-level features, and are classified by their corresponding random forests with a novel fusion scheme. Then, posterior probabilities of all cuboids are concatenated to generate our mid-level feature. Moreover, temporal context between local cuboids is exploited as another type of low-level feature. Experiments on the Weizmann, UCF sports, Ballet, and multi-view IXMAS datasets demonstrate that the proposed framework is able to effectively fuse multiple low-level features into a discriminative mid-level feature.Secondly, the thesis investigates the challenging problem of joint segmentation and recognition of actions in a long-time video, and proposes a novel latent discriminative structural model that simultaneously performs temporal segmentation and action recognition in time series. We introduce latent variables for the purpose of discovering semantically meaningful and discriminative concepts shared among different actions. The proposed model describes the interaction among the video feature, latent concepts and actions, and models the temporal context of video segments at both action level and concept level. For an input video with multiple actions, a dynamic programming algorithm is adopted to find the optimal video segmentation, while simultaneously recognizing the action in each segment. Experiments show that the proposed method can effectively divide a video into segments and label them with actions.Thirdly, in order to analyze and understand complex activities, the thesis proposes a hierarchical description of an action video, referring to the “which” of complex activities, “what” of atomic actions, and “when” of atomic actions happening in the video. Each complex activity is characterized as a composition of atomic actions with simple semantic information. A latent discriminative structural model is developed to automatically detect the complex activity and atomic actions, while simultaneously analyzing the temporal structure of atomic actions. We introduce a segment-annotation mapping for relating videos segments with their associational atomic actions, allowing different video segments to explain different atomic actions. The segment-annotation mapping is treated as latent information in the model. Moreover, a semi-supervised learning method is presented to automatically predict the atomic action labels of unlabeled training videos when the labeled training data is limited, which could greatly alleviating the problem of laborious and time-consuming atomic action annotation of training data. Experiments on three datasets demonstrate the effectiveness of the proposed method.Finally, the thesis proposes an action recognition and localization method based on transfer learning. A novel Transfer Latent Support Vector Machine(TLSVM) is developed for joint recognition and localization of actions by using Web images and training videos which are only annotated with action labels. Action locations in videos are modeled as latent variables, and an unsupervised method is adopted to generate a set of spatiotemporal region candidates. TLSVM is able to recognize an action while simultaneously select a region candidate as the result of action localization. For the purpose of improving the localization accuracy with some prior information of action locations, we introduce a number of Web images which are annotated with both action labels and action locations to learn a discriminative model by enforcing the local similarities between videos and Web images. A structural transformation based on randomized clustering forest is used to map the Web images to videos for handling the heterogeneous features of Web images and videos. Experiments on two public action datasets demonstrate the effectiveness of the proposed model for both action localization and action recognition.
Keywords/Search Tags:video understanding, action representation, action recognition, mid-level feature, atomic action, discriminative structural model
PDF Full Text Request
Related items