Font Size: a A A

Deep Convolutional Video Representation Learning

Posted on:2022-10-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Z ZhouFull Text:PDF
GTID:1488306323465384Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Video is a vivid recording and description of objective things,an intuitive and specific way of information transmission and expression.At the same time,with the ad-vent of the Internet era,videos also have become one of the most important information carriers for mankind.Video Representation Learning aims to leverage data-drive algo-rithms to extract the semantic vectors from original videos and provide representative semantic features for downstream tasks.In recent years,with the advent of deep learn-ing,Deep Convolutional Neural Networks based algorithms have greatly boosted the utilization efficiency of visual data and model performance,laying a solid foundation for many real-life applications.However,when processing complex video signals with spatiotemporal characteristics,existing deep model designs and learning algorithms still face serious problems such as low efficiency,high computational cost,and insufficient performance.In order to overcome these limitations,on the one hand,this thesis observes the spatiotemporal asymmetry of natural video signals,that is,the amount of information encoded in the spatial domain is significantly greater than that in the temporal domain.Therefore,this thesis proposes that one should consider such asymmetry in the pro-cess of model designing,with distributing the calculation modules in a deep convo-lutional network to the spatial and temporal parts of the video signal on-demand and unevenly.This can largely reduces the computational complexity and optimization dif-ficulty while improving the performance of the model.Further,this thesis leverages the Bayesian Deep Networks to theoretically guarantee the efficacy and generalization of the spatiotemporal heterogeneous deep networks which is designed to be data-dependent,which provides a solid theoretical foundation and comprehensive experimental obser-vations for subsequent work in this field.On the other hand,since the amount of information carried by the video signal is much greater than that of the image,it is more expensive to manually label it.There-fore,from the perspective of self-supervised learning,this thesis extends the Variational Inference by considering the inherent random properties and spatiotemporal decoupling characteristics of the natural videos,and innovatively proposes a high-order Variational Autoencoders as well as a shadow convolution operation.It enables deep spatiotemporal models to learn more general and representative video representations without manual annotations,which facilitates the learned representations reaching the most advanced performance on multiple downstream tasks.In order to comprehensively verify the effectiveness of the proposed scheme,this thesis incorporates millions of video data,and performs comparison and verification on several tasks such as human action recognition,video multi-label annotation,video retrieval,and video prediction.On the tasks of human action recognition and-video retrieval,the spatiotemporal heterogeneous architecture proposed in this thesis can achieve the best classification results on several datasets,and the spatiotemporal de-coupled self-supervised scheme can further improve the performance,even comparable to supervised learning;on the task of video multi-label annotation,the proposed adap-tive fusion pooling significantly improves the recall and accuracy;on the task of video prediction,the proposed high-order Variational Autoencoders successfully predicts the multiple futures of natural videos,indicating that it effectively captures the random at-tributes in natural video,which helps build more representative and complete video representations.Through comprehensive experimental evaluation and theoretical analysis,the spa-tiotemporal asymmetric design concept proposed in this thesis has become one of the consensus in the community;the proposed high-order Variational Autoencoders and spatiotemporal decoupled self-supervised framework have also become a new paradigm of video self-supervised training pipeline,which extends the performance boundary and provides new perspectives for the community.
Keywords/Search Tags:Video Representation Learning, Video Understanding, Human Action Recognition, Spatiotemporal Fusion, Deep Learning, Self-supervised Learning, Spatiotemporal Heterogeneous Network, Computer Vision
PDF Full Text Request
Related items