| With the rapid development of portable devices,Internet and other related industries,the number and coverage of videos show a blowout growth.The growth of video quantity provides more favorable support for video intelligent analysis technology.Video action recognition is one of the most important aspects of video intelligent analysis technology.Video action recognition technology relies on computer and other computing equipment,which uses intelligent analysis models such as deep learning to classify target actions in video.Compared with traditional image processing technology,video action recognition technology contains rich visual representation,more efficient spatial and temporal features,which are of more significance to action recognition task.With the innovation of hardware and the improvement of computing power,the relevant models based on deep learning have achieved remarkable results in the task of action recognition,especially in the comprehensive utilization of spatial and temporal features of dual stream convolution network.However,there are still some problems in this kind of methods,such as the insufficient ability of feature expression,the low comprehensive utilization rate of spatial and temporal features,and the lack of interactivity of dimensional features.In view of the above problems,the main research work of this paper is as follows.By analyzing the characteristics of different single-mode information in the process of video extraction,two new types of modal data,C2OF(Combined 2 Optical Flow)and DG(Directed Gradient)are proposed,and the network access structure is designed.Two kinds of modal information improve the network representation ability,and the recognition accuracy is effectively improved in the two classic action recognition datasets UCF101 and HMDB51.To further improve the performance of the two main frameworks,this paper proposed GSTIN for spatiotemporal feature fusion.GSTIN designs a spatiotemporal feature fusion module In BST,which can make network obtain the interactive temporal and spatial information;Based on the spatiotemporal feature fusion module In BST,GSTIN constructs a multi branch GSTIN suitable for action recognition.On two classic action recognition datasets UCF101 and HMDB51,the recognition accuracy is 93.8% and 70.6%,respectively.An intelligent monitoring system based on action recognition is designed and developed.The system can label sensitive action types,extract and display multimodal information of realtime frames,load different types of deep learning models as back-end intelligent analysis support library,and display back-end prediction results with visual interface.The system realizes the intelligent analysis of video actions,so as to assist relevant personnel to carry out early warning and monitoring. |