Font Size: a A A

Human Action Recognition Based On Spatial-temporal DenseNet

Posted on:2019-10-28Degree:MasterType:Thesis
Country:ChinaCandidate:J Q ZhongFull Text:PDF
GTID:2428330566487224Subject:Engineering
Abstract/Summary:PDF Full Text Request
The rapid development in the field of computer vision makes it possible for computers to understand video contents.As one of the most important part of video understanding,human action recognition has become an active and challenging research topic.It has board and promising applications,and plays important roles in various domains in our lives,such as auto-drive assistance,motion analysis in sports and intelligent video surveillance.However,some challenging aspects in human action recognition still have not been well addressed yet.For examples,appearance and motion variances among various individuals might lead to the increase of extra-variance and the decrease of intra-variance,which could degrade the recognition accuracy.Besides,illumination and angle variances also have negative impacts on the final recognition results.The main focus of this thesis is to conclude the existing related works in the area of human action recognition,analyze their shortcomings,and propose a novel human action recognition algorithm based on spatial-temporal DenseNet to well address the existing problems.The main contributions of this paper are as follows:First,we propose a 3D DenseNet model.Since videos are actually comprised of image sequences,the temporal information will lose if features are extracted from a single image.To fully exploit temporal information,we expand conventional DenseNet model from 2D to 3D,which enables the networks to effectively extract features in image sequences.The introduced 3D convolution and 3D pooling can effectively improve the recognition accuracy of human action recognition in video.Second,we propose a novel human action recognition method based on spatial-temporal DenseNet.We construct spatial-temporal model based on 3D DenseNet.The spatial-temporal DenseNet includes two information flows,i.e.spatial information flow and temporal information flow.The spatial flow uses length-fixed image sequences as inputs while the temporal flow utilizes length-fixed dynamic image sequences as inputs.The predicted results of each flow are fused in the final classification layer to output the recognition results.Third,since the spatial and temporal information of image sequences are separated but interrelated,in this thesis,we merge spatial and temporal information in the proposed DenseNet model.To better extract spatial-temporal features,we introduce difference strategies and methods for fusion,and perform thorough experiments to explore the effect of different fusion strategies and methods on the proposed DenseNet for human action recognition.Finally,we test the proposed algorithm on two publicly available human action recognition datasets,i.e.UCF101 and HMDB51.The experimental results show that the proposed algorithm can achieve high recognition accuracies(93.1%and 68.7%respectively)on these two datasets,of which the results are significantly better than some common and superior methods,especially with 2.3%improvement in precision on HMDB51 dataset.In addition,model parameters of the proposed algorithm are at least 10 times less than other networks which are used for human action recognition.This can effectively reduce the complexity of network models and speed up the process of training and test.
Keywords/Search Tags:Human action recognition, Two-stream convolutional networks, Spatial-temporal DenseNet
PDF Full Text Request
Related items