Font Size: a A A

Video Human Action Recognition Based On Spatial-temporal CNN Framework

Posted on:2020-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:H YuFull Text:PDF
GTID:2428330596471426Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,the application prospects in the fields of videos surveillance,behavior analysis and video retrieval are more clear.Due to the diversity of videos,such as the complexity of the background environment,the transformation of the spatial perspective in the videos,and the complexity of human action,such as intra-class differences and inter-class differences between actions,the research on human action still faces Many challenges.Aiming at the above problems,this dissertation proposes a fusion of improved temporal and spatial convolutional neural network human action recognition algorithm(spatial-temporal STCNN)to obtain the spatial characteristics and temporal features of human action in videos.Both temporal and spatial networks use an improved CNN framework at the same time.The main research work are as follows:Firstly,the human motion optical flow information extracted by the improved dense trajectory(iDT)algorithm is proposed as the input of the CNN temporal network.When the feature extraction of human action is carried out through the temporal network and the spatial network,the input of the spatial network is the RGB data of the image,the temporal network is combined with the iDT algorithm,and the input is the data of the continuous optical flow field,and the traditional optical flow is adopted.The algorithm extracts the trajectory of human action in multiple video frames and then performs Histogram Of Gradient(HOG)features,Histogram Of Flow(HOF)features,and Motion Boundary Histograms around the trajectory(MBH)feature extraction,extracting significant areas of human action,and then tracking the trajectory of human action.Secondly,a spatial affine transformation(STN)algorithm based on the traditional CNN framework is proposed.The traditional CNN frameworklacks spatial invariance when processing input data,because the pooling mechanism of CNN has no spatial transformation ability such as translation and rotation for identifying the whole target,and only a small transformation is reflected when processing deep feature maps.Therefore,the STN is merged with the CNN framework,and the input video frames is spatially transformed by the STN network,for example,a series of operations such as rotation and scaling.;after the first convolutional layer,five STN structures are placed,and the five regions of interest are extracted from different parts of the human body.The transformed feature map is then subjected to high-level feature extraction through the convolutional layer and the pooling layer.This experiment is carried out on the standard human action datasets HMDB51 and UCF101.The experimental results show that the recognition rate of the algorithm on the HMDB51 dataset is 90.5%,and the recognition rate on the UCF101 dataset is 62.2%.The experimental results show that the proposed algorithm can achieve better recognition results than other human action recognition algorithms.
Keywords/Search Tags:Human action recognition, convolutional neural network, spatial affine transformation, dense trajectory algorithm
PDF Full Text Request
Related items