Cloud computing and edge computing have been rapid development,people have become accustomed to living under the surveillance of cameras and participating in video-sharing social services.The abundance of video information has attracted more and more researchers to the field of video understanding,and action recognition is a fundamental problem in video understanding with a wide range of applications in surveillance systems,human-computer interaction and video retrieval.Due to the development of deep learning techniques and large-scale labeled datasets,research on supervised action recognition has progressed rapidly in recent years.However,due to factors such as privacy ethics,collection costs,and labeling costs,it is difficult to obtain sufficient sample data for some action categories,leading to the scalability problem faced by traditional methods,and therefore small sample action recognition research is of greater relevance.The task is still in the early stage of research,with great development potential,and equally faces significant challenges.To this end,this paper proposes an efficient video spatio-temporal feature extraction method and designs action recognition models on two small-sample learning methods,prototype networks and data augmentation,respectively,with the following main research content:(1)Efficient understanding of fine-grained spatio-temporal information of videos is pivotal for few shot action recognition,this paper proposes an efficient spatio-temporal feature extraction unit based on spatio-temporal separation and long-range temporal modeling,which enhances its motion feature modeling capability by embedding it into the Res Net backbone network in which its motion feature modeling capability is enhanced.First,the spatio-temporal dynamic gating module approximates the saliency of motion using the amount of feature differences in adjacent frames,and uses this difference as a gating vector to separate features into motion strong-relate features and motion weak-relate features for temporal and spatial modeling,respectively.Then,the temporal attention aggregation module groups the motion strong-relate features along channels,constructs a temporal pyramid structure to capture temporal features of different spans,and uses the attention mechanism to aggregate the groups of features to achieve long-range temporal modeling.Experiments show that the method can effectively improve video temporal modeling capability and enhance the accuracy of action recognition.(2)To address the problem of task irrelevance caused by the use of a meta-task learning format in the prototype network,as well as the problem of outlier samples within classes and overlapping distributions between classes in the support set,this paper proposes a few shot action recognition method based on task relevance and distribution calibration.First,the task-aware learning module captures the internal relationship between samples in task learning and uses this relationship to determine the feature representation of each sample that focuses on a specific task.Then,the category-aware calibration module focuses on the positional distribution of sample features for each class in the metric space,recalculates the prototype representation to mitigate the effect of outlier samples within classes,besides,it optimizes the distance between all classes and improves the interclass separation to alleviate the inter-class overlapping problem.Experimental results on relevant datasets show that the method in this paper outperforms other newly proposed methods and can significantly improve the accuracy of few shot action recognition.(3)To address the problem that discriminative visual features cannot be synthesized using GAN in data augmentation methods and the problem that information between base class and new class cannot be effectively transitioned,this paper proposes a data synthesis and knowledge-driven action recognition method based on few shot learning.First,the cross-modal visual feature generator uses the semantic information of class labels as conditional elements for discriminative information mining to synthesize discriminative new class visual features.Then,the knowledge-driven action classifier uses external knowledge to construct a knowledge-relationship graph that represents the relationships between action classes,and uses a graph convolutional neural network to optimize the node relationships in the knowledge-relationship graph to form an action node classifier.By comparing and analyzing with other data augmentation methods,the method in this paper performs better in both standard few shot learning and generalized few shot learning. |