With the development of the times,people pay more and more attention to public social security,massive video data set has become a derivative of people’s life will certainly produce,the wide coverage of surveillance video and the popularity of cell phone shooting function have made the human abnormal action recognition algorithm become more and more important in this era of big data,in the field of hospital care,traffic supervision,smart home field and some large public places intelligent in the field of hospital care,traffic supervision,smart home and some large public places intelligent security field,make good use of video abnormal action recognition algorithm can greatly improve the supervision efficiency.With the increase of people’s demand,human abnormal action recognition algorithm plays an increasingly important role in the society nowadays,and its combined technology with deep learning permeates all aspects of the society.Deep learning based on human video abnormal action recognition algorithm can do the processing and learning of the model directly using video data,thus greatly enhancing the generalization ability of the model,which basically replaces the traditional manual extraction feature method and becomes a hot research topic in the field.The two-stream method and 3D convolutional network are classical model algorithms in deep learning,but they also have their own shortcomings.In this regard,an inflated 3D network that combines the advantages of both 3D convolutional network and two-stream algorithm has emerged in recent years.This inflated algorithm draws on the advantages of the simple model architecture of two-stream method,but also has the excellent processing ability of 3D model for complex data,and it is on the basis of this backbone network that this paper conducts a series of improvement studies.After the condition of being able to train deeper networks,the attention has shifted more to how to handle the temporal information of the video.In many cases,people focus more on action patterns when studying action recognition classification,but in real life,in addition to patterns,rhythm is also an important factor in judging behavior.For example,the key to identifying running and walking is precisely the difference between their visual rhythms.The traditional way to deal with visual rhythm is to build a frame pyramid,but this approach requires a dedicated backbone network for each layer,which is very computationally intensive.In this paper,we simulate the input of each different frame rate by directly using hyperparameter one-step sampling at the time of temporal semantic adjustment,which is used to capture the visual rhythm information.This direct capture method is different from the traditional frame pyramid,which has a strong dependence on the contextual information of the system over a long distance and is affected by the grasp of global information,and the information loss during the contextual information transfer will have a large impact back to the result,so the system also needs to pay more attention to the processing method of contextual information.In this paper,we introduce a non-local neural network module to replace the traditional LSTM localization mechanism in order to enhance the network’s ability to grasp global information and model long-range modeling capability.Finally,we use a large mainstream dataset to test the performance of the network,and our improved network model has improved compared with other cutting-edge algorithms on this large dataset.In this paper,we use the I3 D model as the main network,based on which a series of improvements are made to cope with the above problems.The main contributions of this paper are as follows:1.To address the problem of network degradation generated by gradient disappearance when training the deep network,this paper introduces residual connectivity based on the original I3 D network idea.2.For the processing of visual rhythm information,this paper directly extracts the output features of different levels of the backbone network,and then simulates different frame rate inputs to temporal features by a set of hyperparametric downsampling in the subsequent processing,which has the advantages of both simple model and high recognition accuracy.3.To address the problem of information flooding that may arise during long time sequence delivery,this paper inserts a non-local neural network module based on a selfattentive mechanism in the middle of the network to enhance the grasp of global features of images.4.We experimentally compare the effects of top-down,bottom-up,cascade,and parallel feature fusion methods on recognition results,and finally find that the parallel method works best. |