Font Size: a A A

Research On Action Recognitions Based On Spatio-temporal Context Modeling

Posted on:2018-03-13Degree:MasterType:Thesis
Country:ChinaCandidate:R GeFull Text:PDF
GTID:2348330542965249Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Video action recognition is one of the hottest spot in computer vision.It has wide application prospects in intelligent monitoring,virtual reality,medical care,robot vision,human-computer interaction and etc.In order to effectively capture the spatio-temporal information of video actions,we propose a spatio-temporal context modeling method.Firstly,we exploit bi-directional spatio-temporal model to extract bi-directional feature representations.And then we exploit adaptive temporal pyramid to extract multi-granularity feature representations.In order to make up for the absence of the above two methods of spatial location information,we combine the visual features and trajectory features to encode more spatial information.Finally,we fuse the above methods by adaptive fusion method.The main works are summarized as follows:(1)Most existing methods pay more attention on previous frames,and ignore the information from the back frames.In this paper,we propose a robust action feature description based on Bi-LSTM(Bi-directional Long Short Term Memory).Firstly,we modify and transfer convolutional neural network VGG16 to our video action recognition.Besides we further conduct some augmentation technique to boost the performance,such as flip and crop.Afterwards,we train the networks and extract deep features.Then we feed these features into Bi-LSTM which can model bi-directional temporal structure and provide more information for the front frames.Experimental results show that our bi-directional model can improve the accuracy of action recognition obviously.(2)Most existing methods build the spatio-temporal models on one temporal granularity,and don't take account of global and local information.This paper proposed to use multi-granularity feature description based on multi-level adaptive temporal pyramid.Firstly,we extract the CNN features from our augmented convolutional networks.According to the idea of temporal pyramid,we split the video segments with different time span adaptively by the energy changes of original videos.It can pay more attention on dramatic action in short time.Then we apply fourier transform for the features in each video segment.At last,we combine the features from different temporal segments to produce final feature representations.The experimental results show that the feature description takes multi-granularity information into consideration and can improve the performance of action with dramatic motion effectively.(3)Existing methods are not capable to fuse multiple models effectively.In this paper,we propose to use adaptive fusion method.Firstly,in order to remedy the spatial location loss of above two models,we fuse augmented convolutional networks features and hand-crafted trajectory features.Finally,in order to take full advantage of the merits of each model,we assign the weights for each model and action adaptively,and exploit the correlation of each action to guide the learning of weight parameters.Experimental results on UCF-101 and HMDB-51 datasets show that it can fuse multiple models effectively and the performance of fusion model is superior to the single model.
Keywords/Search Tags:action recognition, convolutional neural networks, recurrent neural networks, fourier temporal pyramid, adaptive fusion
PDF Full Text Request
Related items