Font Size: a A A

Research On Human Action Recognition Based On Two-stream Fusion Convolutional Neural Network

Posted on:2019-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:L Q XueFull Text:PDF
GTID:2348330542993910Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
In recent years,video and pictures have become an important information carrier on the Internet.At the same time,video surveillance is becoming more and more important to ensure the security of public places.So how to use computer vision technology to automatically identify and analyze human behavior in the video has become a research hotspot.Among the traditional methods of human action recognition,the features need to be designed artificially,and the recognition result are strongly relied on the experience of the feature designers.However,the emergence of deep convolutional neural networks allows researchers to get rid of complex feature engineering,the features are automatically learned from the raw data.In this thesis,two-stream convolutional neural network is taken as the basic theory and Temporal Segment Network(TSN)model is taken as the basic framework,we mainly focuses on the two-stream network structure design and the network fusion timing and method of two-stream network.The two-stream network include two recognition streams(spatial and temporal),which have same structure and extract static and motion information of the video respectively.In the two-stream network,the input of the network is the RGB images and the dense optical flow extracted from the original video.At the same time,the flow MHI and other data input forms such as the dynamic images are also tried.In fact,this thesis adopts the BN-inception network structure,The Inception modules are stacked by several convolution kernels of different sizes and Pooling layers together,increasing the adaptability of the network to process various scale visual information,and the introduction of batch normalization also reduces the internal covariate shift and accelerates the training process.In the process of training,we fine-tuning the network step-by-step.By first training the temporal networks and the spatial networks separately on UCF101 dataset,the single network can effectively extract the static and the motion information respectively.Then,the trained two networks's parameters are used as the two-stream fusion network initial parameters and we just fine-tuning the fusion layer.In the network fusion method part,the summation fusion,maximum value fusion,convolutional fusion,and hybrid fusion are considered.Finally,Their performances are evaluated based on the UCF101 dataset.Taking the classification results of some of the videos as examples in this part to analysis the classification Characteristics of spatial networks and temporal networks.The accuracies of several fusion methods and single stream network are evaluated,which verifies the advantages of network fusion.The average accuracy of hybrid fusion is reaches 92.73%.verifying that two-stream convolutional fusion networks can effectively fusion the information extracted from single stream and compare to fusion the output result of both stream,this method is more efficiently and end-to-end.
Keywords/Search Tags:Action Recognition, Deep Learning, Two-Stream Convnets, Network Fusion, UCF101 Dataset
PDF Full Text Request
Related items