Font Size: a A A

Research On Spatial-temporal Information For Action Recognition

Posted on:2019-05-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y N LiuFull Text:PDF
GTID:1318330569487398Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Video action recognition is one of the hot research fields in computer vision.It has wide application prospects in fields such as intelligent video surveillance,human-robot interaction,abnormal human behavior detection,and video content analysis.The purpose of research on action recognition is to enable the computer to recognize,analyze,understand,and predict the behavior of the target in the video as a human through the analysis of the spatial-temporal structure and the feature description of the video.However,due to the complexity of video spatial-temporal structure and the diversity of video content,the current researches on video action recognition still face difficulties such as how to efficiently extract video feature representations and how to accurately obtain video region information.This paper mainly studies how to utilize video spatial-temporal information to better describe video features and how to use video spatial-temporal relationships to extract video feature.The research contents and innovations include the following aspects:First,an action recognition method based on feature encoding of segmented video region is proposed.The method first extracts the local features of the video and uses bagof-visual-words to encode the features.Next,the video is segmented based on supervoxel,and local features are pooled into each of the segmented regions to form mid-level feature descriptions.The mid-level feature descriptions are again encoded using bag-of-visualwords,and pooled into the whole video to form the video representation.In order to find the region of interest in the video,the proposed method combines the appearance,lighting,motion saliency of the video,and the discriminative power of the segmented region to form joint saliency,and the saliency of each segmented region is determined by the joint saliency.Experimental results show that the accuracy of the proposed method on multiple datasets is better than that of local feature descriptions.Second,a multi-stream deep network for action recognition based on the human gaze assistence is proposed.First,we use the eye tracker to record the eye movements when the subject observes the video,and to construct a human gaze map.Next,a human gaze prediction model based on full convolutional network is designed.In order to effectively utilzie the human gaze information,the proposed method designs a multi-stream deep network.This network combines video appearance information,motion information,and human gaze information to recognize video action categories.Experimental results show that the proposed method outperforms the traditional deep network model on multiple datasets.Thirdly,a two-stream deep network for action recognition based on multi-temporal scale stacking optical flow is proposed.The proposed method has two advantages over traditional motion network: Firstly,the video motion intensity distribution is analyzed.The video is segmented into regions with different degrees of importance from the temporal domain.During training and testing,the regions with high intensity of motion are preferentially adopted.Secondly,the proposed method constructs a temporal-domain multi-scale motion network.This network uses the optical flow information from different temporal-domain scales as input,and can provide more abundant motion information than a single temporal-domain scale.Experimental results show that the proposed method outperforms the traditional two-stream deep network on multiple datasets.Fourthly,a deep network for action anticipation based on self-supervised learning is proposed.The traditional deep learning requires a lot of data annotations in order to train the network to learn video feature descriptions.The proposed method utilizes the correlation between video and video,full video and sub video as supervision information,we can design a deep network without annotation and learn the feature description of the video by optimizing the network parameters.Since the correlation between sub video and full video is used during training,this model can be used for both action recognition and action anticipation.The experimental results show that the proposed method outperforms the existing self-supervised learning methods in action anticipation on multiple datasets.
Keywords/Search Tags:Action Recognition, Video Motion Information, Self-supervised Learning, Deep Neural Networks, Human Gaze Detection
PDF Full Text Request
Related items