| The sophisticated hardware systems and rich sensing devices of modern operating rooms provide good prerequisites for the development of key content analysis methods in surgical scenarios.Based on real-time surgical video and other signals,the automatic analysis of surgical scenes mainly realizes functions such as workflow,instrument,behavior,target organization,and surgical action triplet recognition.,so as to avoid wrong operation and reduce the possibility of complications;after operation,it provides efficient operation report generation and analysis functions.However,compared with natural scenes,surgical scenes have problems such as high similarity of tissue surfaces and large differences in surgical operations performed by doctors,making it difficult to extract spatio-temporal fusion features that are independent of irrelevant information in the surgical background.Although predecessors have tried to solve the difficulties in various tasks of surgical scene parsing from several perspectives and have made some progress,the existing methods still cannot reach the level of precision and parsing that meets the needs of surgical safety.Therefore,in this paper we propose the following methods to solve the difficulties of the subject and the limitations of existing work:On the problem of surgical workflow recognition,a convolutional recurrent network with joint step-stage mapping function based on contrastive learning is proposed to address the spatiotemporal inconsistency in surgical videos and the limitations of single-level workflows.The network uses the distraction module to extract richer spatial detail features in the scene,uses the step-stage mapping function and the long-short-term memory network in the multi-level branch to realize the twolevel workflow recognition,and introduces the ternary loss function in the contrast branch to solve the difficulty.The problem of large intra-class differences and small inter-class differences caused by spatio-temporal inconsistencies over frames eventually guides the network to learn spatio-temporal fusion features that have finegrained feature recognition capabilities and can resist spatio-temporal inconsistencies.Validated on the cataract phacoemulsification and ablation data set,the model convergence speed is accelerated without increasing the calculation amount,and the workflow recognition accuracy is improved,reaching an F1 score of 94.74%.The method won the second place in the 2020 MICCAI Surgical Workflow Recognition Competition.On the recognition of surgical instruments,actions and target tissues,a general recognition method based on multi-label mutual channel loss is proposed for the finegrained characteristics of key content in surgical scenes.The multi-label inter-channel loss utilizes the discriminative module and the diversity module to achieve decoupling of visual features corresponding to categories at the feature layer.The discriminative module groups the features and uses the discriminative function to generate probabilities for the corresponding classes.The diversity module makes channels within the same group pay attention to different regions respectively,and finds local detail features of images corresponding to specific categories.Experiments on the Laparoscopic Cholecystectomy Open Competition dataset achieved average accuracies of 82.10%,51.10%,and 45.50% on instruments,behaviors,and target tissues,respectively,compared to 4.6%,3.7%,and 4.0% for the SOTA method,respectively.,showing its ability to extract fine-grained spatial features.On the problem of surgical action triplet recognition,two multi-task learning strategies,spatial mapping function and joint loss,are designed to fuse the spatiotemporal features under the subtask of surgical action triplet identification,and a network structure based on multi-task learning is proposed.By introducing a multi-task branch and a long-short-term memory network module to capture motion information in the temporal dimension,spatio-temporal fusion triplet features of a given video sequence are generated.The experimental results show that the multi-task network can achieve better classification performance,which proves that the correlation between the components of the triple is beneficial to the analysis of the triple content in the scene.Reached 66.58% Top 5 accuracy and 35.8% average precision on the public competition dataset,surpassing the existing SOTA method by 3.1 percentage points,and ranked 2. |