With the rapid development of Internet technology,a large amount of multimedia information data has been generated,and video understanding and action recognition in realtime surveillance have gradually become popular research directions with wide application prospects in several fields.In recent years,deep neural networks have achieved superior performance in many visual recognition tasks,but such models rely heavily on manually labeled datasets,which require significant labor and time costs to obtain.In contrast,a large amount of unlabeled data is readily available on the Internet,and unsupervised learning using readily available unlabeled data has attracted a lot of attention from researchers,so it is worthwhile to investigate in depth how to utilize unlabeled data and improve the performance of video action recognition.In this paper,we study a self-supervised learning method based on deep neural networks from the multimodal perspective of video data to mine its own supervised information from unlabeled data and learn representations that are useful for downstream tasks,so as to build an efficient human action recognition model.The main research includes the following two aspects:(1)Cross-modal temporal contrastive learning for self-supervised action recognitionTo address the performance limitation of video feature modeling in fine-grained scenes,considering the temporal continuity of video action sequences and the semantic relevance of multimodal information,this paper proposes a self-supervised algorithm for cross-modal temporal contrast learning(CMTCL).The local temporal contrast learning method is designed to adopt different positive and negative sample division strategies to explore the temporal correlation and discriminability between non-overlapping segments of the same instance and enhance the fine-grained feature expression capability;the global contrast learning method is studied to increase the positive samples by cross-modal semantic cotraining to learn the semantic consistency of different views of multiple instances and improve the generalization capability of the model.Extensive experiments on two publicly available action recognition datasets,UCF101 and HMDB51,show that the proposed method improves on average by 2% to 3.5% over cutting-edge mainstream methods.(2)Cross-view consistency mining for self-supervised skeleton action recognitionTo address the problem that single view of skeleton sequence is semantically limited in depth feature expression and considering the information consistency of multiple views of skeleton,this paper proposes a self-supervised algorithm for cross-view consistency mining(CVSCL).Combining multiple skeleton augmentation methods to generate positive sample pairs in contrast learning to increase the spatio-temporal diversity of skeleton sequences and improve the generalization of single-view representations;based on the prior knowledge of skeleton single-view representations,the cross-view consistency mining method is investigated to mine the hard positive examples of samples through the correlation constraints between views and learn the cooperative representation of multiple views.Experimental results show that the proposed method in this paper can effectively improve the accuracy of action recognition on the NTU RGB+D 60/120 datasets under unlabeled settings. |