With the continuous development of artificial intelligence and human-computer interaction,gesture recognition technology has been used more and more widely.Under the constraints of many usage scenarios and needs in actual production and life,designing a visual dynamic gesture recognition method based on deep learning has gradually become a key research direction.However,it can be found that there are still many problems to be solved in this field by analyzing the current research status from different aspects.From a microscopic perspective,most studies use the classic3 D convolution module as the basic structure to build a deep learning model,which realizes the extraction of spatiotemporal features but often ignores the factors such as the large scale of the model or the difficulty of effectively modeling time-domain information.On the other hand,from a macro perspective,taking into consideration the operational efficiency of the model and the ability to integrate global correlation information is a problem worthy of attention and research.In addition,single-modal data cannot provide sufficient practical knowledge in specific application environments.Therefore,a better multi-modal fusion method can improve the overall recognition effect to a certain extent.This paper mainly studied the visual dynamic gesture recognition method in the field of deep learning and achieved the following results:(1)In recognition methods based on convolutional neural networks,aiming at the problems of large network size,difficulty modeling time-series information,and insufficient ability to extract useful features,this paper proposed a spatiotemporal convolution structure with multi-scale fusion in the time domain and a channel attention module using global information synchronization to build a deep learning model.The feature extraction process used the spatial convolution and the temporal residual-like convolution to extract the features with multi-scale receptive field range to optimize the lack of context information.Subsequently,the model encoded global and saliency information and used a few training parameters to build global dependencies between channels to improve feature representation.The ablation experiments show that the constructed network structure improves the overall recognition accuracy by 4.45 percentage points and seriously optimizes the parameter scale.Also,by contrast with different advanced methods,the effectiveness of the proposed model is further proved.(2)In the current deep learning network framework,aiming at the problems of poor parallel computing ability and insufficient ability to integrate global correlation information,a spatiotemporal self-attention module was designed,combined with the convolutional neural network to construct a composite deep learning architecture using a grouping mechanism.Firstly,the model used selfattention to process high-level spatiotemporal features,which could model global correlation information through matrix relationships among high-dimensional tensors,and guarantee the parallel computing capability of the model.Secondly,the introduced grouping mechanism can reduce the complexity of the model to a certain extent while obtaining rich multivariate information.The performance analysis and ablation experiments verified that the spatiotemporal self-attention module effectively improves the recognition effect.The comparison results among various recognition methods also fully demonstrate the advantages of the constructed composite network framework.(3)Aiming at problems such as difficulty improving accuracy and poor model robustness caused by insufficient practical information provided by single-modal data,a spatiotemporal feature mutual information module was introduced to construct a multi-modal fusion method.It realized the information flow between modalities in some of the vital feature extraction nodes and,at the same time,weakened the redundant information contained in the spatiotemporal features.The ablation and comparative experiments show that the fusion method reaches a more than 2% improvement in recognition accuracy and achieves a better dynamic gesture recognition effect.This paper analyzed the deep learning model for dynamic gesture recognition in detail,starting from the research directions of micro-structure composition,macro-frame construction,and multimodal data information flow.Then,according to the existing problems,the corresponding network structure was constructed,verified by experiments,and achieved the common research goal. |