| Gesture recognition has always been an indispensable part of the field of human-computer interaction.Its research status and the development of human-computer interaction technology complement each other.Although the current artificial intelligence technology has improved the accuracy of gesture recognition,in actual operation,it is still affected by many factors.The scene in the gesture picture is complex and changeable,and there are too many interference factors.The size and position of the hand in the gesture dataset are different,hand shapes and colors vary widely.These factors will affect the feature extraction of the hand by the network,thus affecting the accuracy of gesture recognition.Therefore,it is an extremely challenging task to accurately recognize gestures in complex environments.In order to solve the above difficult problems,this paper will conduct the research in the following aspects:(1)In complex background,gesture recognition becomes difficult because it is difficult to distinguish the gesture from the surrounding background.In order to reduce the influence of complex background on gesture recognition,this paper proposes a gesture recognition network based on the fusion of Histogram of Oriented Gradients and Faster Regional based Convolutional Neural Network.Firstly,the gesture region was extracted by object detection.Then,the HS module was introduced to fine-tune the classification loss value of the network to improve the gesture recognition accuracy of the network.Finally,Squeeze-and-Excitation spatial attention mechanism is introduced to improve the accuracy of the predicted bounding box.(2)Gesture recognition is affected by the complexity of the scene,hand shape and hand size in the image.In order to solve these problems,this paper proposes a gesture recognition network based on improved Light-weight,General-purpose,and Mobile-friendly Vision Transformer to improve the robustness of network feature extraction.In order to enhance the sensitivity of the network to hand part,an MS-MViT module is designed to increase the ability of the network to extract the global information of the feature map through self-attention mechanism of mixed shape window,and reduce the influence of complex scenes on the recognition accuracy of the model.By introducing the Spatial Pyramid Pooling module,the local and global features of the feature map are combined to enrich the expression ability of the feature map,so as to better recognize different sizes of hands.(3)This paper designs a deep network gesture recognition system.Depending on the type of image to be recognized,selecting the appropriate network for gesture recognition and realizing the visual display of recognition results bring users a higher degree of freedom and help users better understand the recognition results. |