Font Size: a A A

Research On Human Action Recognition Based On Deep Learning

Posted on:2019-03-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:S YuFull Text:PDF
GTID:1368330545497330Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Human action recognition enables the computer to automatically analyze the unknown events in the video by using machine learning and computer vision methods.Due to cluttered background,lighting conditions,different rates,partial occlusion,human action recognition pattern has large variation in inter-class and within-class.Human action recognition is a classical challenging task in target recognition,and also a hot research topic in computer vision.Human action recognition has many useful applications such as intelligent video surveillance,video retrieval,human-computer interface.Therefore,human action recognition in video has important theoretical value and broad application prospects.Based on a survey of the state-of-the-art of human action recognition in video,action recognition is divided into two type methods:hand-crafted features based methods and deep learning methods.During the deep learning methods,the CNNs usually contain million parameters which prone to overfit when training on small datasets.Meantime,it is difficult to capture long-term temporal information for action recognition in video.According to the problems in human action recognition,we proposed some corresponding solution methods.The major works and contributions are summarized as follows.1.Recently,convolutional neural networks(CNNs)have established impressive results for many image recognition tasks.The CNNs usually contain million parameters which prone to overfit when training on small datasets.Therefore,the CNNs do not produce superior performance over traditional methods for action recognition.In this study,we design a novel two-stream fully convolutional networks(FCNs)architecture for action recognition which can significantly reduce parameters while keeping performance.Meantime,a max pooling with larger stride is used to compute a frame-level compact feature,which can take advantage of spatial and temporal information.Our FCNs fuse the pixel-wise corresponded appearance and motion features by a linear weighted fusion method,which can significantly improve the accuracy.To capture temporal information of human actions,a video pooling method is adopted to construct the video-level features.Several video pooling methods such as FV(Fisher Vector,FV),VLAD(Vector of Locally Aggregated Descriptors,VLAD)and TPP(Temporal Pyramid Pooling,TPP)have been studied,and we find that TPP is the most suitable pooling method for constructing the video-level features.2.Action recognition in video is one of the most important and challenging tasks in computer vision.Spatial-temporal information plays a crucial role to represent video for action recognition.In this paper,a recurrent hybrid network architecture is designed for action recognition by fusing multisource features:a two-stream CNNs for learning semantic features,a two-stream single-layer LSTM for learning long-term temporal feature,and an improved Dense Trajectories(IDT)stream for learning short-term temporal motion feature.In order to mitigate the overfitting issue on small-scale dataset,a video data augmentation method is used to increase the amount of training data,as well as a two-step training strategy is adopted to train our recurrent hybrid network.Experiment results on two challenging datasets UCF101 and HMDB51 demonstrate that the proposed method can reach the state-of-the-art performance.Recurrent neural networks can effectively process long-term series information,especially in text and voice information processing.However,recurrent neural network-based action recognition methods are prone to overfit in existing action recognition datasets,and the recognition accuracy is lower than hand-crafted feature methods.Meantime,shallow recurrent neural networks also have difficulties in learning rich semantic features.In order to improve the ability of recurrent neural networks to learn rich semantic features,we design a pseudo residual recurrent neural network.First,the residual learning method is introduced into the recurrent neural network,and the number of layers of the recurrent neural network is increased to a medium scale(3?4 layers).Compared with shallow recurrent neural network,pseudo residual recurrent neural network can learn richer action semantic features.Then,in the pseudo residual recurrent network,we found that the signal connection method in the original deep residual network is not applicable to recurrent neural networks.We connect the input features of the network with each hidden layer,and found that this type of residual connection is more suitable for the learning of action recognition features.Finally,the fusion of pseudo residual recurrent neural network and iDT features improves the accuracy of the model.The results show that the neural network features and handcrafted features are complementary on human action recognition.In summary,this paper starts from the problems of too many parameters and insufficient samples of training dataset in the deep learning model,and proposes a human action recognition method based on full convolutional neural network.Then,a hybrid recurrent neural network is designed for the learning of long-term motion features.Finally,for the case that the semantic features of shallow recursive neural networks are not rich,a human action recognition method based on pseudo residual recurrent neural network is proposed.Experiments have been carried out on UCF101 and HMDB51 datasets.The proposed method can effectively learn long-term motion features with rich semantics of video,and improve the performance of action recognition.
Keywords/Search Tags:Action Recognition, Deep learning, Convolutional Neural Network, Recurrent Neural Network, Residual learning, Pseudo Residual Recurrent Neural Network
PDF Full Text Request
Related items