Font Size: a A A

Human Action Recognition Based On Zero-Shot Learning

Posted on:2020-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:R Q AnFull Text:PDF
GTID:2428330575498563Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Human action recognition has been considered as a central topic in the field of computer vision for its significant theoretical value and application prospect.With the development of deep learning,encouraging breakthroughs have been achieved on the task of human action recognition based on multi-shot(supervised)learning.However,these methods require a large number of annotated training data and cannot be extended to the recognition task with few or no training samples,which limits the generalization ability of recognition model.Zero-shot learning can transfer the knowledge of training data to the prediction of unseen categories during test time,so it provides innovative ideas for solving the above problems.At present,most existing zero-shot action recognition methods primarily focus on still image,while the application of such methods to the zero-shot action recognition problem based on video sequence will lead to the loss of temporal information and the inability to effectively learn the relationship between visual features and class semantics of complex actions.To solve these problems,the paper focuses on building a more effective vision to semantic mapping by using visual features with temporal characteristics and semantic representation with semantic correlation.Furthermore,we extend the single-label task to the recognition of multi-label action data.The main contributions are summarized as follows:(1)A zero-shot action recognition method based on temporal modeling and spatiotemporal network is proposed.This method designs a two-stream spatiotemporal network,in which RGB information and optical flow information are processed respectively in the spatial stream network and the temporal ones.Features are extracted by convolutional neural network and then fed into recurrent neural network for further modeling the sequence context information and fully obtaining temporal dynamic information of video sequence.Finally,the spatiotemporal features with high-level semantics are fused to enhance the representational ability of visual embedding so as to achieve better recognition performance.(2)A zero-shot action recognition method based on joint space and spatiotemporal network is proposed.The joint(common)space is used to bridge the gap between visual space and semantic space,and the visual characteristics of video data and the semantic representation of categories are mapped into this space to learn the corresponding relationship of visual semantics.This kind of mapping relationship can not only model the relationship in each dimension of visual features and semantic representation,but also simultaneously optimize the correlation of visual features,semantic representation and action categories,so as to build a more effective mapping function between visual feature and class semantics and improve the recognition performance of unseen classes.(3)A zero-shot action recognition method based on joint space and multi-label learning is proposed.Given the complexity of multi-label learning,this paper uses the joint latent embedding learning to learn a joint latent space for visual features and semantic representation.In the learning space,the visual features and semantic representations of human actions are mapped to visual embedding and semantic embedding respectively.Visual model and semantic model are trained alternately and multi-loss function optimization is designed during training to achieve multi-label zero-shot action recognition task.
Keywords/Search Tags:Zero-Shot Learning, Human Action Recognition, Temporal Modeling, Spatiotemporal Network, Semantic Embedding Space, Joint Space, Multi-label Learning
PDF Full Text Request
Related items