| Gestures,as the most commonly used way of communication in daily life besides language,play an important role in the interaction between people and machines.Gesture recognition technology has become a hot research topic because of its efficient,fast and user-friendly interactive characteristics;according to different gestures,it can be divided into static gesture recognition and dynamic gesture recognition.Because dynamic gesture is far more interactive than static gesture,it has more research value and application prospects.However,due to the complex temporal and spatial characteristics of dynamic gesture data,dynamic gesture recognition is also challenging.Traditional machine learning algorithms are difficult to extract accurate dynamic gesture features.The existing dynamic gesture recognition network based on depth learning is complex in design,large in model parameters,and low in gesture recognition accuracy,which is not conducive to practical applications.For the above problems,this thesis proposes a multi-scale spatiotemporal feature fusion network based on convolutional vision transformer(Cv T),which achieves high recognition accuracy with a smaller network model.First of all,this thesis introduces the lightweight Cv T network used in the field of image classification into the field of dynamic gesture classification.The Cv T model is used as the backbone network,and the model parameters are reduced to extract the spatial features of a single gesture image that has been processed by video framing,and the shallow features of different spatial scales at different stages of the Cv T network are fused with the deep features.Secondly,a multi time scale aggregation module is designed to extract the spatio-temporal features of dynamic gestures by using 3D convolution,and combine the Cv T network with the multi time scale aggregation module to suppress invalid spatio-temporal features.After network construction,the proposed network in this thesis is optimized.In order to make up for the shortcomings of Dropout,the network proposed in this thesis is integrated with the Regularized Dropout(R-Drop)model to solve the inconsistency between the training model and the test model brought by Dropout.The method in this thesis is verified by experiments on the public large dynamic gesture dataset Jester,compared with many dynamic gesture recognition methods;at the same time,in order to verify the generalization performance of the model,this thesis builds a dynamic gesture dataset.The experimental results show that the recognition rate of this method is better than the existing dynamic gesture recognition methods on Jester dynamic gesture dataset,and the computational power of the model is lower than that of the frontier method,which reduces the number of parameters and improves the recognition accuracy;the method in this thesis performs well on the self built gesture dataset,and it can correctly recognize gesture actions in videos.This thesis used Py QT5 to build a gesture recognition system,which can realize local video recognition.The system interface is simple and easy to operate,providing a way for gesture recognition applications. |