Font Size: a A A

Research On Deep Learning-Based Video Action Recognition

Posted on:2019-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:S W ShiFull Text:PDF
GTID:2348330542469401Subject:Engineering
Abstract/Summary:PDF Full Text Request
Intelligent video monitoring serves as a technology by means of computer vision and video image processing to fulfill intelligent analysis of video sequence.As the focus of intelligent video analysis,video-based action recognition means extracting from video sequence significant visual characteristics to describe motion pattern,which is subsequently accompanied by undergoing classification and understanding with machine learning and pattern recognition algorithms,in order to realize behavior pattern recognition for video object.It belongs to a higher level of visual tasks,and is also a challenging research topic in the field of computer vision and pattern recognition.The paper mainly studies and proposes the Spatiotemporal Residual Networks based on Two-stream Convolution Networks architecture as well as related algorithms and technologies,and discusses related application of the practical engineering.The paper studies Two-stream Convolution Networks and Residual Networks,the former have displayed strong performance for human action recognition in videos,while the latter have arisen as a new technique to train extremely deep architectures.The paper introduces in detail these two kinds of network architecture,and in the meantime designs and proposes the Spatiotemporal Residual Networks architecture,which uses Two-stream Convolution Networks as baseline architecture and leverages pretrained Residual Networks models in large-scale database(for image classification task)to initialize both streams of the networks,which can make full use of massive amounts of training data for images in the tasks of video action recognition.The paper studies the cross-stream residual connections between both streams of Spatiotemporal Residual Networks.Aimed at the drawback that the original Two-stream Convolution Networks architecture only allows the fusion of respective softmax prediction of both independent streams in the final stage for information interaction,which is the cause of failing to truly learn spatiotemporal features,this paper explores several feasible cross-stream residual connections,and carries out contrast experiments as well as analysis of various connection methods in detail.The paper studies temporal residual connections injected into Spatiotemporal Residual Networks.Even if the original two-stream netw ork also employed only a small temporal window of 10 frames while making predictions,which subsequently were averaged over the video,larger intervals of time turn out to be more appropriate for many real world actions.To provide Spatiotemporal Residual Networks with greater temporal support,the paper proposes temporal filtering with feature identity,which means 1D temporal convolutions combined with feature space transformations initialized as identity mappings to achieve temporal residual connections.Several alternatives for injecting such temporal kernels within the hierarchy of the network are also explored.First,various choices position where the temporal kernel is injected into overall architecture of the networks are studied.Second,various initialization of the temporal filter kernels are designed,involving setting them to perform either averaging or centering in time,which is the same for all feature channels.At last,temporal global max-pooling is proposed to increases the spatiotemporal receptive field for the purpose of learning long-term temporal relationships between the features.This paper also studies Spatiotemporal Residual Networks with two asymmetric streams.Multitask learning on different datasets is achieved by replacing both streams of two-stream architecture with 50 and 152 layer models for Residual Networks respectively.Based on what is learned from above study,this paper propose a final Spatiotemporal Residual Networks architecture for action recognition in video,which is called Asymmetric Two-stream Multiplier Spatiotemporal ResNets.In order to realize hierarchical learning of complex spatiotemporal features,the whole model is trained end-to-end.In two widely used action recognition benchmarks,the proposed model reaches and exceeds the previous state-of-the-art.
Keywords/Search Tags:Intelligent video monitoring, Spatiotemporal Residual Networks, two asymmetric streams
PDF Full Text Request
Related items