| In computer vision tasks,human action recognition refers to a technology that relies on specific artificial intelligence algorithms to enable machines without autonomous consciousness to have a certain degree of brain-like thinking to understand the specific meaning of behaviors.In recent years,it has been widely used in many life scenarios such as human-computer interaction,video surveillance and medical insurance,gradually attracting the attention of relevant researchers and becoming a major research hotspot in the field of computer vision.Although the existing action recognition technology has developed rapidly and made great research progress,there are still some problems and challenges in accurately identifying the action category in the actual scene,mainly including the problem of pose diversity caused by individual differences,the extraction of temporal dependencies and the limited expressiveness of the global fixed skeleton graph structure,etc.In response to the above problems,this work starts with the spatio-temporal structure characteristics of behavioral sequence data and the limitations of existing recognition models to carry out research on human action recognition based on spatio-temporal dynamic representation,aiming to reduce the difference between actions of the same type and increase the degree of discrimination between different action types.Firstly,the graph structure is introduced to represent the human pose data,according to its spatial distribution information and structural characteristics,an appropriate spatial difference descriptor is constructed to improve the robustness of the feature representation to spatial pose changes.Secondly,according to the representation of the graph structure,a graph convolution operation based on the connection relationship of the spatial pose joints and the difference weight parameter in the form of a vector are designed to realize a stackable learning deep network structure and an end-to-end working mode from data input to prediction output.Then,a temporal graph representation structure is designed in combination with the temporal dependencies,so as to maintain the inherent time-series variation law and improve the feature representation of spatio-temporal sequence data.Finally,based on the distribution characteristics of sample data,a dynamic representation module that can adaptively learn global graph structure features and local graph structure features is designed and implemented to improve the model’s adaptive dynamic learning and expression ability.Specifically,the main contributions of this work can be summarized in the following four aspects:1.A human action recognition method based on spatial difference descriptors is proposed.First,by introducing the graph structure representation,our method can represent the spatial structure characteristics of the human body posture while retaining the channel difference information;then according to the spatial distribution information and structural characteristics of the joint points of the human body posture,we creatively use the second-order statistical operations to construct a suitable spatial difference descriptor to improve the robustness of feature representation to spatial pose changes,and get rid of the constraints of spatial distribution diversity of sample data.In order to verify the recognition performance of the proposed model,experimental tests were carried out on three general standard action recognition datasets,and the recognition accuracies were better than other comparison methods,which also proved that extracting spatial difference information is helpful for mining data to reduce intra-class differences and increase the differences between classes,thereby improving the recognition performance of the model.2.A spatial graph convolutional network is proposed based on the spatial difference descriptor,which realizes the end-to-end working mode from data input to prediction output by designing a general deep learning structure.Firstly,starting from the graph structure representation of behavior sequence data,a graph convolution operation based on the connection relationship of spatial pose joints is designed to complete the extraction of human spatial pose features;then,according to the structural characteristics of the same functional module layer in the deep neural network,the vector-form difference weight parameter is designed to optimize the spatial difference feature representation of learning human poses;finally,for the temporal dependencies contained in the action sequences,based on the inherent time-order information,the CNN structure is introduced directly based on the input feature to calculate the temporal dependencies to parallelize the data processing and speed up the network training process.The experimental results on three general datasets show that the proposed method achieves better recognition results,which also verifies the effectiveness of the model structure.3.On the basis of the spatial graph convolution structure,we continue to explore the temporal dependencies in the time domain,and implement a temporal graph representation structure that can better extract temporal dependency changes.Compared with other topological representations,the local division strategy defined when constructing the temporal graph structure can better reflect the differential influence of weight mapping,and the corresponding operation structure is more stable and controllable,enhancing the differential expression ability of the model.Moreover,the graph convolution operation designed according to the temporal graph representation structure can still achieve effective temporal feature representation and efficient parallel computing while keeping the temporal relationship unchanged,which solves the problems such as information leakage caused by traditional CNN structure and low computational efficiency caused by the sequential computation of hidden states in RNN units,improving the recognition performance and computational efficiency of the model.Finally,a human action recognition model based on the spatial and temporal graph representation network is obtained by combining the spatial graph convolution structure,effectively improves the overall recognition accuracy.4.To address the information loss of local details caused by using the fixed spatio-temporal graph structure for intra-frame and inter-frame human skeletons,a data-driven adaptive dynamic representation method for skeleton graphs is designed with in-depth study exploring the influence of graph structure framework on the action recognition model performance.Firstly,a global dynamic representation structure is constructed based on the distribution characteristics of the overall action sequence data,and a more representative general skeleton graph representation is extracted by adaptive learning,which effectively enhances the general expression ability of the overall behavior characteristics and improves the overall recognition accuracies;then,a local dynamic representation structure is built through the embedding transformation and relevance representation,so as to calculate the optimal skeleton graph representation with obvious discrimination for a single action sequence,improve the accurate classification ability of different categories of behavior,and get rid of the limitation of manually designed graph structure parameters;finally,the global dynamic representation and local dynamic representation structure are combined with spatio-temporal gaprh convolution structure to adaptively extract human action features,further improve the dynamic learning ability and recognition performance of the model.Experimental results on two deep learning standard datasets show that the proposed method outperforms other state-of-the-art models by 3.4% and 3.1%,which verifies that the proposed method can effectively improve the overall recognition performance of the model. |