Font Size: a A A

Temporal Action Detection Based On Deep Learning

Posted on:2022-04-17Degree:MasterType:Thesis
Country:ChinaCandidate:X TianFull Text:PDF
GTID:2568306488980789Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the prosperity of online video and the rise of intelligent monitoring,the call for video analysis has become stronger.Among them,the temporal action detection of video has also received more attention.This paper focuses on the construction of the action detection network framework and the optimization of the positioning of action in time sequence.Mainly from the following two different perspectives,the study of temporal action detection methods is carried out.(1)Action detection method based on three-dimensional convolution and temporal proposal.This method takes advantage of the characteristics of three-dimensional convolution processing video that can generate features that contain time-dimensional information.The two-stage target detection method Faster R-CNN first performs region recommendation and then performs image target detection operation process as inspiration.The image target detection in two-dimensional space is extended to the video temporal action detection in three-dimensional space.First,use C3 D to construct a three-dimensional feature extraction sub-network,extract the three-dimensional feature of the video,and input it into the temporal proposal sub-network to extract the temporal proposal segment,and finally combine the three-dimensional feature and the temporal proposal segment to jointly classify the action in the proposal segment.For the purpose of improving the performance of the action detection network,3D-resnet is used to reconstruct the feature extraction sub-network.Compared with the experimental results of the feature extraction sub-network constructed by C3 D,similar accuracy is obtained when using 3D-resnet34,and the training time of each round of the model is reduced by 1.17 h,and the memory occupied by the model is reduced by 324 MB.(2)Action detection method based on time domain deconvolution and layer-by-layer spatial convolution.This method improves the temporal proposal subnet of R-C3 D from the perspective of improving the temporal positioning of action.Aiming at the classification of actions in the spatial domain,a layer-by-layer spatial convolution method is proposed.This method adds a convolutional layer with a specific structure to the temporal proposal subnet of the R-C3 D network to improve the network’s ability to classify actions in the video.Subsequently,for the positioning of the action in the time domain,with the help of the idea of deconvolution operation in the CDC network,the deconvolution operation is used to increase the length of the feature map on the time axis.Combining the layer-by-layer spatial convolution method,a time-domain deconvolution-layer-by-layer spatial convolution algorithm is proposed which is improved in the temporal and spatial domain at the same time.In order to verify that the idea of increasing the length of the feature map by deconvolution in the CDC network is applicable to the R-C3 D network,a comparative experiment one is made:entering videos of different durations,it is found that the video with the largest duration corresponds to the highest detection accuracy,which verifies that increasing the length of the feature map by the deconvolution operation can improve the accuracy of the temporal action detection network.To verify that increasing the length of the feature map can improve the positioning accuracy of the action,a comparative experiment two is made: the time-domain deconvolution-layer-by-layer spatial convolution algorithm is introduced into the Boundary Sensitive Network(BSN).In this link,the layer-by-layer spatial convolution and time-domain deconvolution in the algorithm are introduced into the BSN timing evaluation module according to different combinations to do ablation experiments.The experiment verifies that the algorithm improves the detection accuracy of the network by means of precise action positioning.
Keywords/Search Tags:video analysis, temporal action detection, three-dimensional convolution, deconvolution, feature map
PDF Full Text Request
Related items