Font Size: a A A

Research On Video Temporal Action Localization Based On Deep Learing

Posted on:2021-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:Q WangFull Text:PDF
GTID:2428330647467254Subject:Intelligent perception and control
Abstract/Summary:PDF Full Text Request
With the development of computer and network technology,video in our life is growing explosively.Video analysis has become a hot issue in the field of computer vision.Temporal action localization is an important branch of video analysis.Given a long and untrimmed video including multiple action instances and complex backgrounds,the first step of temporal action localization is to identify their action classes,and the second step is to locate the start time and end time of each instance.According to these two tasks,this paper proposes novel methods,the specific work is as follows:(1)In terms of action recognition,there are currently two mainstream models of twostream convolutional neural network and 3D convolutional neural network.The two-stream convolutional neural network has the problems of slow speed and insufficient correlation of spatio-temporal features;Convolutional 3D(C3D)network as a typical 3D convolutional neural network.In the input data,there is redundancy and insufficient extraction of motion information.Structurally,the fusion of spatio-temporal features is not sufficient.For this reason,this paper first improves the input of C3 D network,adopts the sparse sampling strategy of saving relevant information at a lower cost,and uses two input forms of RGB images and optical flow images.Experiments show that the sparse sampling scheme can improve the recognition efficiency,and the motion information of video can be reasonably used to improve the accuracy of action recognition by introducing optical flow graph as input.On this basis,this paper proposes a spatio-temporal feature fusion action recognition model based on sparse sampling.The RGB images and optical flow images obtained by sparse sampling are sent to the spatio-temporal two-stream convolutional neural network to extract the temporal and spatial features of the video,respectively.By fusing the spatio-temporal convolutional neural network,the middle-level spatio-temporal fusion features that can effectively reflect the spatio-temporal correlation are extracted,and finally the middle-level spatio-temporal fusion features are sent to C3 D network identifies the action class.Experiments on UCF101 and HMDB51 datasets show that the framework can effectively improve the accuracy of action recognition.(2)In terms of temporal action location,the existing candidate region selection algorithm and action recognition network are not enough to extract the spatio-temporal features,and the results of the start and end boundary location algorithm often deviate from the real boundary.In this paper,two models are proposed to improve the accuracy of temporal action location task.Firstly,in order to solve the problem that the accuracy of C3 D network in temporal action localization task needs to be improved,this paper proposes a multi-stage segment temporal action localization model based on temporal segment networks.The model first uses multi-scale segmentation to generate video segments,then through the proposed network,classification network and location network composed of temporal segment network with higher accuracy,finally through the non-maximum suppression algorithm to complete the temporal action localization.Secondly,aiming at the problem that the accuracy of candidate region selection algorithm and start and stop boundary location algorithm needs to be improved,a spatio-temporal feature fusion temporal action localization model is proposed.In this model,the spatio-temporal feature fusion action recognition model based on sparse sampling is used as the candidate region extraction network,and the temporal and spatial feature of video segmentation segment is fully used to determine whether it is a candidate region.Then,the candidate regions are input into the convolutional-de-convolutional network to classify the actions at frame level granularity.Finally,we train the action state detection network to refine the candidate areas,so as to get more accurate starting and ending time.Experiments on the THUMOS' 14 dataset show that the two models are better than the existing methods.
Keywords/Search Tags:video analysis, action recognition, temporal action localization, spatiotemporal features
PDF Full Text Request
Related items