Font Size: a A A

Human Action Recognition In Videos Of Realistic Scenes Based On Multi-Scale CNN Feature

Posted on:2019-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y S ZhouFull Text:PDF
GTID:2428330566979996Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The Internet technology has developed rapidly in recent years.Accompanied by the growth of image,voice,text and other information data,the development of computer vision and artificial intelligence has been promoted.The task of automatic human action recognition in realistic scenes has gained increasing popularity due to its importance in the applications of video indexing and retrieval,intelligent video surveillance,human-computer interaction,etc.The core problem of human action recognition is to recognize and analyze human action in videos.However,recognizing human actions in realistic scenes is a challenging task because of the existing camera movement,scale transform,cluttered background,variations in view-point,partial occlusions and shade.General human action recognition consist of three main procedures: first,visual features extraction;second,encoding the video sequences into a feature representation;third,training a classifier based on the training data with labels and use the final obtained classifier to do the classification and recognition.Hence,a robust and discriminative feature representation of video sequences is crucial to the performance of action recognition.In this dissertation,we develop a novel method to design a robust feature representation based on deep convolutional features and Latent Dirichlet Allocation(LDA)topic model for human action recognition.Compared to traditional CNN features which explore the outputs from the fully connected layers in CNN,we show that a low dimension feature representation generated on the deep convolutional layers is more discriminative.In addition,based on the convolutional feature maps,we use a multi-scale pooling strategy to handle the objects with different scales and deformations better.Moreover,we adopt LDA to explore the semantic relationship in video sequences and generate a topic histogram to represent a video,since LDA puts more emphasis on the content coherence than mere spatial contiguity.In addition,in order to get the correlation of action sequence in time domain,we introduce the Long Short-Term Memory structure,and construct an end-to-end deep network model based on multi-scale convolution neural network and Long Short-Term Memory network.The Proposed approach builds a multi-scale feature representation in the convolutional layer by automatic learning fusion weight parameters.The multi-scale feature of video frame will be fed into the Long Short-Term Memory network to obtain dynamic information of frames.Finally,the output of each LSTM unit will be fed into a classification layer and output the action label.In order to verify the effectiveness of the algorithm proposed in this dissertation,we carried out experiments on two challenging datasets:UCF-Sports and UCF-11.The experiment directly uses convolution feature graph of VGG-16 model trained on ImageNet.Extensive experimental results on two challenging datasets UCF-Sports and UCF-11 show that the proposed two algorithms can both make better achievement in human action recognition.
Keywords/Search Tags:Human Action Recognition, Multi-Scale Pooling, CNN Feature Representation, Latent Dirichlet Allocation (LDA)
PDF Full Text Request
Related items