Font Size: a A A

Research On Temporal Modeling Method For Video Understanding

Posted on:2022-06-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y S ChenFull Text:PDF
GTID:1488306734471864Subject:PhD in Engineering
Abstract/Summary:PDF Full Text Request
With the rapid popularization of video capture equipment,video has gradually become one of the largest information carriers in society.How to perceive and understand these videos has become a key problem in video editing,search,and matching applications,which has been widely studied in the field of computer vision.In the human-centered video understanding scene,it is necessary not only to represent the video content but also to detect the temporal action of people in the video.For example,in the task of automatically generating highlights of event intelligent clips,we first need to represent the spatial and temporal information in the event video,then need to obtain the starting point and end point of event action in time sequence,and use the video representation to classify event action,Finally,the automatic generation of highlights is generated by combining the event domain knowledge and the clips of event events.In the task of video representation,the traditional methods of video representation based on hand-crafted features have certain advantages in interpretability and background noise suppression,but they are poor in robustness and processing speed;The method of video representation based on deep learning 3D convolutional neural network can learn the spatial and temporal information in the video at the same time,but the application of 3D convolutional neural network in the practical scene is limited by its large amount of calculation and high storage requirements.Therefore,how to efficiently represent video is a challenge for video representation.In the task of temporal action detection,the two-stage temporal action detection method first locates the action instance by generating temporal action proposals,and then classifies the generated temporal action proposals by action classification;The one-stage temporal action detection method regresses the behavior examples through the end-to-end model and completes the positioning and classification of temporal action at one time.These methods can effectively detect the simple temporal action in the ideal scene.However,in the real-world scene,they still face many challenges,such as large change of action time span,complex background,and rich action content.Given the above challenges,this thesis focuses on the temporal modeling method for video understanding and makes in-depth research and discussion from the aspects of video temporal and spatial feature representation and temporal action detection.Specifically,the main novelties and contributions are summarized as follows:(1)To efficiently support video representation,this thesis proposes an efficient spatiotemporal feature representation method based on grouping pseudo three-dimensional convolution: GP3D(Group Pseudo 3D Convolution).In this method,the convolution separation is realized by grouping,and the video is represented by decomposing the three-dimensional convolution into two-dimensional convolution for extracting spatial features and one-dimensional convolution for extracting temporal features.Based on GP3 D,two network frameworks for video representation are constructed: one is GP3D-MBV3 network combined with efficient two-dimensional convolution network mobilenetv3;The other is the GP3D-INC network constructed by combining the leading I3 D network in the video classification task.Through experiments on two benchmark data sets,UCF101 and HMDB51,and compared with the state-of-the-art method I3 D,this method significantly reduces the calculation and storage requirements of the three-dimensional convolutional neural network,and solves the problems of calculation and spatial complexity in training.(2)To address the problem that the action time span changes greatly and it is difficult to use the temporal proposal context information,this thesis proposes a temporal action detection method based on boundary GCN: B-GCN(Boundary Graph Convolutional Network).In this method,the video spatio-temporal features are fused through a base network,and then the temporal action proposal features are generated through a sampling module.Then,a graph convolution network is constructed on the temporal action proposal features to learn the information of adjacent temporal action proposal features: Each temporal action proposal feature is taken as the node of graph convolution,and the edges of graph convolution are dynamically constructed according to the distance between temporal action proposal features.This method includes two branches of graph convolution network: one is used to generate action classification score map and action regression score map;The other is used to generate boundary start score map and boundary end score map,which are fused to generate dense temporal action proposals.Combined with video level action classification network,temporal action detection is realized.Experimental results on two benchmark data sets,THUMOS14 and Activity Net1.3,show that compared with the state-of-the-art methods such as BMN,B-GCN not only has advantages in detection performance but also has fewer model parameters and higher processing efficiency.(3)To reduce the affect from the complex background and rich action content for temporal action detection,this thesis proposes a temporal action detection method based on three-dimensional convolution dynamic routing capsule boundary network: Caps Bound Net(Capsule Boundary Network).This method uses a capsule network to avoid the invariance limitation caused by the convergence and failure of the convolutional neural network,so as to better understand the temporal relationship of temporal action detection.In order to reduce the computation and complexity of the capsule network,a three-dimensional convolution dynamic routing based on the k-nearest neighbor mechanism is constructed.In addition,in order to enable the capsule network to learn multi-scale features,a U-shaped capsule network framework based on a threedimensional convolution capsule is proposed.The experimental results on the two benchmark data sets of THUMOS14 and Activity Net1.3 show that compared with B-GCN,B-GCN has an advantage in processing efficiency,while Caps Bound Net greatly improves the accuracy of temporal action detection.Compared with the other state-of-the-art methods such as BMN and P-GCN,Caps Bound Net also has obvious advantages in detection performance.(4)This thesis explores the application of the temporal modeling method for video understanding in event intelligent editing.The application uses temporal modeling methods such as temporal action representation and temporal action detection to realize the automatic generation of highlights of basketball events.In this application,GP3D-INC is used to extract the features of “shot”video,and then the extracted features are used to detect the temporal action using B-GCN model,so as to more accurately locate the event action to complete the editing of wonderful events.In addition,in order to match the “shot” of the race event and the “shot” of the playback event,using a three-dimensional relationship network module to replace the traditional distance measurement for video matching,and good results are achieved.Finally,combined with domain knowledge,the wonderful collection of basketball events is automatically generated.This technology saves the labor cost of event video editing and realizes automatic intelligent editing.This technology has been applied to the actual generation system of CCTV sports and other TV stations and achieved good engineering application results.
Keywords/Search Tags:video understanding, spatiotemporal representation, temporal action detection, temporal action proposal generation, deep learning
PDF Full Text Request
Related items