Font Size: a A A

Research On Cross-domain Human Action Recognition Via Transfer Learning

Posted on:2020-08-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:1368330602450275Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Human action recognition is an important subject in the field of computer vision research and application,and it has a very broad application prospect in the fields of medical monitoring system,intelligent home system,virtual reality,human-computer interaction system,intelligent security,content-based video retrieval,athlete aided training system,and so on.In traditional action recognition framework,two conditions are usually met: 1)the number of training samples should be sufficient to make it possible to learn a good classification model,and 2)testing samples and training samples should have the same distribution.However,these two conditions cannot always be satisfied in real-world scenario.On the one hand,with the rapid development of the Internet and the emergence of deep learning technology in the era of big data,the amount of video data is growing rapidly every day,it is impractical to manually label a large amount of video data,which will cost a lot of human,material and financial resources.On the other hand,some emerging novel video modalities,such as videos collected from different environments(background complexity,illumination,scenes,etc.),videos collected from different camera perspectives(front view,side view,top view,etc.),videos collected by different sensors(such as visible light video,thermal infrared video,video captured by depth camera,etc.),action data of different media types(such as images,videos,motion data collected by various sensors,etc.).For these videos from different modalities,the difficulty of collecting training samples is very different,which leads to the shortage of training samples from some modalities and large distribution difference among different video modalities.If the traditional classification method is used to address the cross-domain action recognition problem,that is,a limited number of training samples are used to learn a classification model in one of the modalities,and the model is directly applied to another modality for classification.Then the classification result would not be accurate and reliable,and the classification performance will be significantly reduced.In this dissertation,specific transfer learning methods are designed to solve problems of cross-domain human action recognition,which can effectively reduce the difference of data distribution in different data domains,reduce the training time and effort required to initialize a novel action recognition system,make the action recognition system more generalized,more robust,and effectively utilize the knowledge from existing data domains.This dissertation focuses on the following three types of cross-domain action recognition problems:1)cross-spectral action recognition,2)cross-view action recognition,and 3)cross-media action recognition.The major work and contributions are outlined as follows:1.In order to solve the problem that there is limited infrared action data for infrared human action recognition until now,a transferable representation learning algorithm based on feature alignment and generalization is proposed to address the problem of infrared action recognition,which utilizes the visible light video as auxiliary data to enhance infrared action recognition performance,where infrared action dataset is used as the target domain,and the self-built visible light action dataset XD145 is used as the source domain.The representations of the source domain and the target domain are mapped to the same latent feature space by using the kernel manifold alignment method,and the aligned feature representations are obtained.Then a pair of aligned-to-generalized encoders are designed for feature generalization,and the generalized features from source and target domains are used to train the classifier.Experimental results show that the proposed method can achieve state-of-theart performance compared with several transfer learning and domain adaptation methods on publicly available infrared action recognition dataset Inf AR.2.To address the problem that existing methods for infrared action recognition are either based on spatial or local temporal information while the global temporal information is not considered,the CNN structure of the mainstream visible light action recognition method is transferred to the infrared action recognition framework,and a novel global temporal representation named optical-flow stacked difference image is proposed,then a three-stream convolution neural network is constructed to extract robust features based on this feature representation.The inputs of the network are optical flow image,optical-flow motion-historyimage and optical-flow stacked difference image,which are used to extract local,spatialtemporal and global temporal information,respectively.Then the trajectory-constrained pooling strategy is adopted to extract the features from the convolution layer of these three streams,and then a novel feature representation is obtained,named three-stream trajectorypooled deep convolutional descriptors.Experimental results show that the proposed opticalflow stacked difference image can better describe the global temporal information of infrared human action,and it is complementary to local temporal information(optical flow image)and spatial-temporal information(optical-flow motion-history-image),and the extracted features can significantly improve the infrared action recognition performance.3.To address the challenge brought by the existence of huge appearance variations in different camera views when conducting action recognition across views,a hierarchically learned view-invariant representation is proposed.Firstly,a sample-affinity matrix is incorporated into the marginalized stacked denoising autoencoder to obtain the shared features,which are then combined with the private features to obtain robust features.In order to make the feature representations of videos across views transferable,a transferable dictionary pair is then learned simultaneously from pairs of videos taken at different views to encourage each action video across views to have the same sparse representation.However,the distribution difference across views still exists because an unified subspace where the sparse representations of one action across views are the same may not exist when the view difference is large.Therefore,a novel unsupervised distribution adaptation method is proposed to learn a set of projections that project the source and target views data into respective subspaces and concurrently encourage the difference of these subspaces to be as small as possible.Finally,the features projected into these subspaces are view-invariant representations.Experimental results show that the proposed hierarchically learned representation is view-invariant and robust to view difference even when the view difference is large,and the recognition performance of the proposed method outperforms most of the state-of-the-art approaches.4.To address the problem that the difficulty of collecting and labeling videos is far greater than that of images,a deep image-to-video adaptation and fusion network is proposed by utilizing the complementarity between images and videos,which is an unified deep learning model by integrating domain-invariant representations learning and cross-modal feature fusion into an unified optimization framework.The algorithm enhances action recognition in videos by transferring knowledge from images using video keyframes as a bridge.Firstly,a novel cross-modal similarities metric is designed to reduce the modality shift among images,keyframes and videos.Then,the learned domain-invariant keyframe features,video features and their concatenations are projected to the same semantic space by learning three newly designed autoencoders with the constraint that the representations from hidden layer should be equal to the semantic representations of the action class names,to obtain more compact,informative and discriminative representations.Finally,the concatenation of the learned semantic feature representations from these three autoencoders are used to train the classifier for action recognition in videos.Experimental results show that the proposed algorithm can effectively enhance action recognition in videos by utilizing knowledge from images even when there are limited training video samples.
Keywords/Search Tags:Human Action Recognition, Transfer Learning, Domain Adaptation, Cross Spectral, Cross View, Cross Media
PDF Full Text Request
Related items