Vascular disease is a disease with a very high fatality rate,which seriously threatens the life and health of patients.Interventional therapy is a minimally invasive,high-tech treatment technique with the characteristics of less trauma,quick postoperative recovery,more indications,and fewer complications.The gesture recognition of the interventional doctor is the basis for the analysis of the surgical process and the prompting of the surgical progress.In the process of interventional doctors manipulating the guidewire for vascular interventional surgery,the frequency and risk level of surgical actions in different stages of the operation are different.Therefore,accurately identifying the doctor’s operation skill mode is an important part of intraoperative situational awareness and risk warning,and it is also an important part of constructing the Fundamentals of the Surgical Environment and Knowledge Understanding Model.At the same time,junior interventionists need to spend a lot of energy learning from experienced experts.Quantifying the operator’s operation behavior,experience,and intuition during the operation is helpful for training novice doctors with less experience,which is the basic component of the operation evaluation and teaching system for interventional doctors.At present,in the medical field,most of the research is based on the surface electromyography signal gesture recognition research.However,the acquisition of surface electromyographic signals(sEMG)requires doctors to wear expensive sensors,and the cumbersome wearing of sensors will interfere with the flexibility of doctors’ hand movements.Therefore,this paper uses a computer vision-based gesture recognition method to collect surgical video through a camera and create an RGB-modality data set of a doctor’s hand movements in interventional surgery.In this paper,we fully investigate the application of convolutional neural networks in gesture recognition and carry out research on gesture recognition algorithms based on 3D convolutional neural networks and 2D convolutional neural networks based on hybrid attention mechanisms:(1)This paper proposes a two-stage network gesture recognition architecture consisting of a detector and a classifier,which enhances the robustness of gesture recognition in practical scenarios,improves gesture recognition efficiency,and reduces system power consumption.The first-level network is a detector,which is used to detect gestures.It is a lightweight 3D CNN,which ensures the real-time performance of the system;the second-level network is a classifier,which is used to classify the detected gestures.It is a deep 3D CNN,which ensures the accuracy of gesture action recognition.During the working process of the gesture recognition system,the sliding window moves along the input video frame sequence in real time,and the video frame sequence is sent to the detector through the detector queue.For the deviation of gestures on the input screen,the detection result is cached and filtered by the posterior processing module to improve the confidence of the detection result.When the detector detects that a gesture has been generated,the classifier is awakened to classify the gesture and obtain a result.Finally,a single activation of the module ensures that each individual gesture is recognized only once.(2)This paper proposes a pluggable hybrid attention module that can be embedded into 2D CNN networks,which makes up for the defect that traditional 2D CNN cannot represent temporal information,greatly improves the accuracy of gesture recognition tasks,and reduces the model compared to 3D CNN.The complexity is more conducive to model deployment in actual scenarios.The SME hybrid attention module designed in this paper is composed of spatiotemporal attention(STA),motion attention(MA),and effective channel attention(ECA),which can extract spatiotemporal information,motion information,and channel information,respectively.These three kinds of information are complementary and vital in video gesture recognition.Among them,the STA module adopts single-channel 3D convolution to characterize spatiotemporal features.The MA module is responsible for computing the feature-level temporal differences,which are then used for motion-sensitive channels.The ECA module adaptively recalibrates the channel-wise feature responses by showing the interdependencies between the modeled channels in time.In this paper,the SME attention module is inserted into four backbone networks(i.e.,ResNet-50,MobileNet V2,ShuffleNet V2,and BNInception),and in four datasets(i.e.,nvGesture,Jester,EgoGesture,and the self-built interventional surgery gesture dataset),it consistently outperforms its 3D CNN and 2D CNN counterparts.Finally,through comparative experiments,it is proved that the gesture recognition models based on the 3D convolutional neural network and the gesture recognition model based on the 2D convolutional neural network hybrid attention mechanism proposed in this paper have higher overall classification accuracy than similar models.And the lightweight architecture is suitable for practical applications. |