Font Size: a A A

Researches On Spatiotemporal Action Detection Based On Deep Learning

Posted on:2022-09-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y M WangFull Text:PDF
GTID:2518306551470064Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The technologies of automatic detection for specific events in public places are of vital significance for the public security and the development of social intellectualization.Therefore,Spatiotemporal Action Detection(STAD)technologies which are often used for the detection of specific behaviors and corresponding locations,are desirable across a broad range of applications.In particular,the Violence Detection(VD)technologies which is mainly used to detect violent incidents,such as the violence in schools and prisons,have been extensively explored to meet the requirements of the application.Although the traditional STAD technology based on manual features has driven to a maturity,the low computational efficiency and poor ability to express feature has greatly hindered its widely practical uses in our daily life and production.Fortunately,the increasing application of deep learning in computer vision have substantially contribute to the development of STAD.In this paper,a thorough research on violence detection and spatiotemporal behavior detection technologies based on deep learning has been conducted,and can be summarized as follow:(1)Generally,the task of violent detection is merely to determine whether there is violent behavior in the provided video,but ignore the information of the corresponding spatial locations.Herein,the VD technology,based on STAD,can not only identify violent behaviors,but also detect corresponding spatial and temporal information.Inspired by the two-stage object detection architecture,we have designed a VD model base on R-CNN.In this model,actor proposal network was used to generate region proposals for humans,and the spatiotemporal features of violent behavior can be obtained by using a three-dimensional convolution and modeling the relevant region features within a certain time range.The high effectiveness of the model has verified by extensive experiments.On the basis of this model,we have further designed a complete VD system which including all the process(from data acquisition to detection results preservation),and optimized the detection process for better performance in practical circumstance(online or offline).(2)The existing methods for spatiotemporal action detection are usually derived from the two-stage detection architecture,including the positioning and classification process,which is widely used in object detection.However,this detection architecture inevitably leads to high computational costs and sub-optimal solutions when applied in STAD.In this paper,a simple and computationally efficient STAD model named MUB-Detector,which is time-sensitive and multi-branched,has been proposed.By using the MUB-Detector,which is based on a three-dimensional convolutional neural network with the powerful ability in spatiotemporal modeling,the STAD task can be simplified as multiple one-stage "object" detections.Then,the spatial location and action category of the action instances in each frame of the input video clip can be obtained and complete the one-stage STAD.Experimental results on two benchmark datasets(J-HMDB and UCF101-24)show that,compared with the method based on the two-stage detection architecture,the unified STAD framework proposed in this paper can effectively improve the detection efficiency.In particular,compared with traditional methods which require additional optical flow which result in expensive computation cost,MUB-Detector can achieve competitive detection accuracy and faster detection speed with only RGB images inputs.
Keywords/Search Tags:violence detection, spatiotemporal action detection, deep learning
PDF Full Text Request
Related items