| Gestures are the specific movements and positions of people when they use their palms and arms.They are the earliest and still widely used communication tools.As one of the important research directions of computer vision field,gesture recognition technology can capture people’s hand movements and convert the recognition results into corresponding control instructions.With the development of artificial intelligence,gesture recognition is widely used in many fields such as human-robot interaction,intelligent home and intelligent driving,making human life more intelligent and convenient.The research and development of gesture recognition,which affects the nature and flexibility of task interaction,has practical research significance and application prospect.In this paper,gesture recognition algorithms are studied based on deep learning theories such as 2-dimensional convolutional neural network,vision transformer,attention mechanism and knowledge distillation,and real-time gesture recognition results are applied to human-robot interaction tasks.The main contents of this paper are as follows:(1)Gesture recognition based on 2-dimensional convolutional neural network.Due to the excessive number of parameters and computation of 3-dimensional convolutional neural network,this paper proposes a gesture recognition method based on 2-dimensional convolutional neural network,which simultaneously avoids the problem of inefficient feature extraction in traditional methods.Considering the accuracy and speed of gesture recognition,the deep residual network ResNet and the lightweight network MobileNet-v2 are used for gesture recognition,respectively,and the parameters of the above networks are initialized using models pre-trained on large datasets.After parameter initialization using the pre-trained model,ResNet-50 obtained over 5%accuracy improvement and MobileNet-v2 obtained over 7%accuracy improvement on the Kinetics-400 dataset.(2)Temporal modeling based on temporal-channel attention.To address the problem of lack of temporal information interaction in 2-dimensional convolutional neural network,the temporal-channel attention(TCA)module is proposed in this paper.The TCA module not only enables modeling along the temporal dimension,but also allows the model to focus on taskrelated features.By embedding the TCA module into backbone networks such as MobileNetv2 and ResNet to form novel networks such as TCA-MobileNet-v2 and TCA-ResNet,the gesture recognition models are able to extract more distinctive features related to gestures and suppress the interference of redundant information.Compared with the backbone networks MobileNet-v2 and ResNet-50,TCA module can bring significant performance improvement.TCA-MobileNet-v2 and TCA-ResNet-50 obtained recognition accuracy of 95.1%and 97.0%on the Jester dataset,respectively.(3)A lightweight hybrid model HybridNet is proposed based on TCA-MobileNet-v2 and vision transformer.In practical application,gesture recognition model should not only keep lightweight,but also pay attention to model accuracy.HybridNet simultaneously combines the properties of convolutional neural network local modeling,vision transformer global modeling and temporal-channel attention module temporal modeling,which not only retains the advantage of lightness,but also obtains the ideal model performance,achieving both accuracy and speed.Compared with TCA-MobileNet-v2,HybridNet obtained an accuracy improvement of 0.91%and 2.33%on the Jester and EgoGesture datasets,respectively.(4)The approach based on knowledge distillation further improves the performance of gesture recognition model HybridNet and is applied to real-time gesture recognition and human-robot interaction tasks.After knowledge distillation,HybridNet obtained 96.3%recognition accuracy on the Jester dataset and 93.9%on the EgoGesture dataset.HybridNet obtained after knowledge distillation is used as real-time gesture recognition model,and the gesture recognition results are task-matched and input to Assistive Gym framework as control instructions which can control the auxiliary robots to perform relevant tasks and realize humanrobot interaction. |