| Behavior recognition refers to the process of automatically analyzing and recognizing human or animal behaviors using computer technology.It has a wide range of applications in intelligent monitoring,intelligent interaction,video retrieval,and other fields.The difficulties in behavior recognition lie in how to extract effective spatio-temporal features from complex and changing sequential data,how to improve the modeling ability of the model,and how to explore different fusion strategy modes.As research continues to deepen,the methods of behavior recognition have evolved from the initial manual feature design to the current deep learning-based methods,and various network architectures and algorithms continue to be proposed.Deep learning-based methods can automatically discover abstract and high-level features from large amounts of data without human intervention.Deep learning can adapt to different types and scales of data,and has strong generalization and expression capabilities.This article is based on the deep learning framework,focusing on the problems existing in the field of behavior recognition,and exploring and researching with traditional neural networks and emerging graph convolutional networks for RGB video and skeleton two input modalities,with the following main contributions:(1)Contextual content information is crucial for studying video human behavior recognition.In this article,a new action recognition method is proposed under the framework of LSTM and CNN,which uses an improved key frame extraction technique that combines sparse sampling with remote-time modeling to form an efficient training strategy.To improve the modeling ability of the temporal pattern,a context-guided bidirectional long short-term memory neural network(Context-Guided Bi LSTM)is designed.The high-level semantic information of adjacent key frames guides the network to learn content relevance and fully aggregates spatio-temporal contextual information.In the fusion module,the low and middlelevel spatial dynamic features and high-level semantic information are encoded by LSTM to integrate the dependency relationship between frames at different levels,further improving the modeling ability.Experimental results on three benchmark datasets,UCF sport,UCF11,and JHMDB,demonstrate that the proposed method has good recognition performance,outperforming most existing action recognition methods.(2)Graph convolutional networks(GCN)have shown excellent performance in skeletonbased action recognition in recent years.To improve the model performance,the key is to effectively encode the topology information of the skeleton.To this end,this article proposes an end-to-end joint-guided global graph convolutional network based on the joint,which first utilizes attention mechanism to guide the learning of joint semantic information,and constructs a more flexible topology graph structure by combining global contextual joint information and local joint information.Then,using the decoupling method of CNN to aggregate various channels and corresponding topology graphs without adding a large number of parameters,this method further improves the spatial expression ability of graph convolution.In order to focus on the joints that affect recognition,a jointly temporal and spatial attention mechanism is designed.In addition,the network adopts a multi-stream architecture,which combines various modal features such as joint stream,skeleton stream,joint motion stream,and skeleton motion stream,and finally weights and fuses the prediction scores of each stream.Experimental results on the NTU RGB+D and NTU RGB+D120 public skeleton datasets show that the proposed method has significant advantages. |