Font Size: a A A

Research On Methods Of Spatiotemporal Feature Modeling Based Activity Recognition

Posted on:2021-10-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:F XueFull Text:PDF
GTID:1488306311971289Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology and the popularization of mobile imaging equipments,videos have become an important way for people to obtain information in their lives.Activity recognition is one of the important research contents in the field of computer vision and video processing,and is the basis of video analysis and understanding.Activity recognition has been widely used in intelligent monitoring,video retrieval,military reconnaissance,human-computer interaction and unmanned driving.Traditional activity recognition methods manily rely on artificially designed features to describe the activity in videos,which have disadvantages such as poor applicability and robustness.In recent years,with the improvement of computer computing power,modeling methods based on deep features have received more and more attention in the field of activity recognition.Deep features can achieve adaptive learning in the process of modeling the human activity in videos through deep neural networks,which have a wider application range and stronger robustness,and have become the most effective methods in the field of activity recognition.However,there are still some key problems that need to be solved in the process of modeling with deep neural network,such as the insufficient ability of the network to represent the activity,the excessive reliance on labeled data for pre-training,and poor real-time performance of recognition.This paper studies the spatiotemporal feature modeling methods of human activity recognition based on the theory of deep neural network,and effectively improves the performance of human activity recognition.The main research results are as follows:1.The problem of the network's insufficient ability to represent the activity is studied.When recognizing human activity,the most discriminative parts are usually sparsely distributed in different moments and regions of the video.If the network treats different frames indiscriminately,it will cause noise interference,resulting in a decrease in the ability of features to represent the activity.This paper proposes an algorithm for activity recognition based on spatial-temporal attention.The algorithm extracts the static and dynamic features of the video through a two-stream convolutional neural network,and adopts the hierarchical convolutional long short-term memory to model the features.Then,the spatial-temporal attention is used to guide the network to pay more attention to important moments and regions in the process of modeling,which effectively enhance the network's ability to represent activity and improve the performance of activity recognition.2.The problem of the pre-training of the network is excessively dependent on labeled data is studied.In the task of activity recognition,large-scale datasets with annotations are used to pre-train the network.However,the cost of obtaining labeled data is relatively high.In reality,there are a large number of easily accessible unlabeled videos on the Internet.In order to use unlabeled videos to pre-train the network,this paper proposes a self-supervised learning algorithm based on mutual information maximization.The algorithm first guides the network to learn the connection between different video clips in the video by maximizing the mutual information between them.Then,in order to avoid that the network only pays attention to the background of the video when maximizing the mutual information between different video clips,according to the characteristics of 2D-CNN and 3D-CNN,methods of maximizing the motion mutual information and maximizing the local mutual information are respectively proposed.Finally,the process of maximizing mutual information is used to complete the pre-training of the 2D-CNN and 3D-CNN.The algorithm can effectively improve the recognition performance of the network and reduce the dependence of the network pre-training on labeled data.3.The problem of poor universality of self-supervised learning algorithms is studied.The self-supervised learning algorithm based on maximizing mutual information applies different mutual information maximization methods to 2D-CNN and 3D-CNN,which leads to poor adaptability to different types of networks.At the same time,the existing selfsupervised learning algorithms also have this problem.This paper proposes a self-supervised learning algorithm based on video pseudo labels.The algorithm first extracts the features of different modal information of the video,and explores all the extracted features to construct a feature set.Then,cluster the features in the feature set,and adopts the clustering results to generate video pseudo labels.Finally,the generated pseudo labels are used to simultaneously train the network with different modal inputs,and guide the network to learn the correspondence between different modalities.In addition,in order to prevent the trivial solution generated by the collaborative clustering and classification,a feature constraint method based on the Siamese network is adopted in the process of feature set construction.This algorithm can be applied to 2D-CNN and 3D-CNN at the same time,and can reduce the dependence of network pre-training on labeled data.4.The problem of poor real-time performance of network recognition is studied.In order to improve the performance of the activity recognition,the static information of the RGB image and the dynamic information of the optical flow are usually used for modeling.However,the calculation of optical flow takes a long time,which will lead to poor real-time performance of the network.This paper proposes a fast activity recognition algorithm based on refined motion vectors.The algorithm first extracts the motion vector in the compressed video,and uses DCT coefficients to refine the motion vector.Then,the refined motion vector is used as the input of the network instead of optical flow,so as to avoid the time-consuming calculation of optical flow.Finally,the algorithm also explores the lightweight network Shuffle Net V2 to construct the two-stream network model to further reduce the model's memory space and improve computational efficiency.
Keywords/Search Tags:Activity recognition, Spatiotemporal features, Network transfer, Unlabeled data, Self-supervised learning, Real-time, Motion vector
PDF Full Text Request
Related items