Font Size: a A A

Research On Spatio-Temporal Interest Point Based Human Action Recognition

Posted on:2019-12-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:B LinFull Text:PDF
GTID:1368330566978006Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Human action recognition has been a research hotspot in the field of computer vision in recent years and is widely used in intelligent video surveillance,motion analysis,virtual reality,human-computer interaction and video retrieval.Aiming at the human action recognition based on spatio-temporal interest points,an in-depth and systematic study of human behavior recognition based on spatio-temporal interest points is carried out in this paper.We discussed the detection algorithm of local spatio-temporal interest points,the descriptor extraction algorithm for local spatial-temporal interest points,the feature encoding method for human action video representation and the classification algorithm for human action recognition.The main contributions and innovations are as follows:1)Aiming at the problem of redundant spatio-temporal interest points which are detected in the backgroud and non-majority motion part under the real world with complex environment,a spatio-temporal interest point selection algorithm based on space-time suppression response is proposed.By calculating the relative spatio-temporal gradients between the evaluation spatio-temporal interest points and the adjacent points in reference space-time volume,an algorithm for measuring the spatio-temporal suppression strength and the response intensity under this suppression is established.According to the magnitude of the response intensity under the suppression,a non-maximal suppression method is used to select the spatio-temporal interest points.The proposed algorithm is evaluated on multiple human action datasets in this paper.Compared to the main spatio-temporal detectors,such as 3D Harris,Cuboid,and Hessian,our method proposed in this paper maintains the detection performance of spatio-temporal interest points under simple environmental conditions.Under the real world with complex environmental conditions,a large number of redundant spatio-temporal interest points which are detected on backgrounds and non-majority motion parts are effectively eliminated,and the detection accuracy is improved.Finally,the human action recognition comparative experiment results show that the algorithm of spatio-temporal interest points detecting proposed in this paper partly improves the human action classification accuracy.2)Aiming at the problem that the local spatio-temporal interest points descriptors are not robust enough to occlusion,noise,local transformation and camera motion,a new histogram of orientation gradients local descriptor extraction method which is based on the spatio-temporal three-dimensional scattering transform coefficients is proposed.Scattering transform is an image transform method based on directional wavelet transform and scale smooth.It has local translation invariance,rotation invariance and elastic deformation stability for local texture features,and has good robustness to local occlusion and noise.In this paper,the scattering transform is extended to spatio-temporal three-dimensional space,and a video spatio-temporal volume transformation method based on spatio-temporal three-dimensional scattering transformation is constructed.The video spatio-temporal volume is reconstructed by the transformed scattering coefficient,and the local texture spatio-temporal volume is obtained.According to the location of spatio-temporal interest points,the histogram of orientation gradients features are extracted from the neighborhood spatio-temporal volume in the image of texture sptio-temporal volume.The final descriptors is a fusion of each local feature with an approximated spatio-temporal pyramid model.Through the human action recognition comparative experiments on multiple behavioral datasets,it is shown that the STC-HOG descriptor proposed in this paper exceeds most of the main local descriptor algorithms in the performance of local feature description for spatio-temporal interest points.Compare to the combined feature with HOG,HOF,and MBH,our local descriptor achieves comparable results.3)For the problem that vector of locally aggregated descriptors(VLAD)loses the high-order feature distribution information during the process of feature encoding,a histogram of distribution vector of locally aggregated descriptors(HOD-VALD)based on Gaussian kernel is proposed.The HOD-VLAD encoding method integrates zero-order statistics information(histogram of distribution)and second-order statistics information(standard deviation)on the basis of the standard VALD which only contains the first-order statistics information(distribution mean)of the feature space.First,a Gaussian statistical model with the cluster center as the mean and the eigenvalues of covariance matrix as the variance is constructed for each feature dimension in every cluster,and the encoding vector with high-order statistical information is generated by the improved VLAD encoding function.Then,the distribution model of each feature dimension in every cluster is quantified and the histogram of distribution is constructed.The final HOD-VALD encoding is a concatenation of the histogram of distribution and the improved encoding vector.The proposed HOD-VALD encoding method is applied to human action recognition based on spatio-temporal interest points,encodes the local features extract from spatio-temporal interest points,and constructs a coding expression for human behavior videos.Through the comparison experiments on multiple human action datasets,the HOD-VLAD encoding method proposed in this paper obtained better behavioral performance in human action recognition.The average recognition accuracy exceeded the BOVW and the standard VLAD.Compared to the FV encoding algorithm,the performance of behavioral encoding is slightly inferior.On the consumption of computing resource,our method is significantly lower than FV encoding.In general,our HOD-VLAD presented in this paper has achieved a good balance between encoding performance and computational resource consumption.4)A human action classification method based on l1-norm error evaluation is proposed.Following the idea of Robust PCA(RPCA),we decompose the video representation features into a linear combination of feature essence and noise signal.Since the l1-norm is insensitive to outlier samples in the model expression compared to the l2-norm,the solution of the optimization objective function can be sparsity.Therefore,according to the method of sparse representation classification,we transform the human action classification problem into a linear optimization problem under l1-norm constraint,and use K-SVD method to construct single sparse over-complete dictionary for each behavior category.Sparse over-complete dictionary of the whole video feature space is a concanation of every single sparse over-complete dictionary.The final classification is the minimal of cumulative residuals of reconstruction based on sparse representation computing on all features in the video sequence.The results of comparative experiments on multiple human action datasets show that the human action classification algorithm based on l1-norm error evaluation which proposed in this paper has achieved a good classification effect and has good robustness to noise in real world with complex environments.
Keywords/Search Tags:Spatio-temporal interest points, Human action recognition, Scattering Transform, Feature encoding
PDF Full Text Request
Related items