Font Size: a A A

Video Spatio-Temporal Representation Learning Methods

Posted on:2021-05-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z F QiuFull Text:PDF
GTID:1368330602994194Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Today's digital contents are inherently multimedia:text,audio,image,video and so on.Images and videos,in particular,become a new way of communication between Internet users with the proliferation of sensor-rich mobile devices.This has encouraged the development of advanced techniques for a broad range of multimedia understand-ing applications.Among them,a fundamental breakthrough underlining the success of these techniques is representation learning.This can be evidenced by the develop-ment of Convolutional Neural Networks(CNN),which demonstrates high capability of learning and generalization in visual representation.For example,the image repre-sentation learnt by the newly-minted residual nets have successfully pushed the limits of image understanding tasks with remarkable improvements in state-of-the-art per-formance.However,learning powerful and generic spatio-temporal representation re-mains challenging,due to larger variations and complexities of video content.Most existing approaches for video applications heavily rely on the image representation and simply perform the 2D CNN on each frame individually,while the temporal evolutions across consecutive frames are not fully exploited.It is crucial to emphasize the tempo-ral dynamics for video content understanding and build a unified video spatio-temporal representation learning framework.To achieve this goal,this thesis starts from the particular neural network architec-tures for videos(e.g.,3D CNN),and studies how to devise and integrate novel struc-tures,including pseudo-3d blocks,convolutional activation encoding,local-and-global diffusion blocks and network architecture auto-design framework,to equip the network with strong ability to learn powerful spatio-temporal representations.In summary,this thesis makes the following contributions:(1)This thesis devises multiple variants of bottleneck building blocks in a residual learning framework by simulating 3D convolutions with 2D convolutional filters on spa-tial domain plus 1D convolutions to construct temporal connections on adjacent feature maps in time.Furthermore,this thesis proposes a new architecture,named Pseudo-3D Residual Net(P3D ResNet),that exploits all the variants of blocks but composes each in different placement of ResNet,following the philosophy that enhancing structural di-versity with going deep could improve the power of neural networks.Our P3D ResNet achieves clear improvements on Sports-1M video classification dataset against 3D CNN and frame-based 2D CNN by 5.3%and 1.8%,respectively.Thia thesis further examines the generalization performance of video representation produced by ourpre-trained P3D ResNet on five different benchmarks and three different tasks,demonstrating superior performances over several state-of-the-art techniques.(2)This thesis presents a novel framework to boost the spatio-temporal representa-tion learning by Local and Global Diffusion(LGD).Specifically,this thesis constructs a novel neural network architecture that learns the local and global representations in parallel.The architecture is composed of LGD blocks,where each block updates lo-cal and global features by modeling the diffusions between these two representations.Furthermore,a kernelized classifier is introduced to combine the representations from two aspects for video recognition.Our LGD networks achieve clear improvements on the large-scale Kinetics-400 and Kinetics-600 video classification datasets against the best competitors by 3.5%and 0.7%.This thesis further examines the generalization of both the global and local representations produced by our pre-trained LGD networks on four different benchmarks for video action recognition and spatio-temporal action de-tection tasks.Superior performances over several state-of-the-art techniques on these benchmarks are reported.(3)This thesis introduces a new idea for automatically exploring architectures on a remould of Differentiable Architecture Search(DAS),which possesses the efficient search via gradient descent.Specifically,this thesis presents Scheduled Differentiable Architecture Search(SDAS)for spatio-temporal representation learning that nicely inte-grates the selection of operations during training with a schedule.Moreover,this thesis enlarges the.search space of SDAS for video data by devising several unique operations to encode spatio-temporal dynamics and demonstrate the impact in affecting the archi-tecture search of SDAS.Extensive experiments of architecture learning are conducted on Kinetics-10,UCF101 and HMDB51 datasets,and superior results are reported when comparing to DAS method.More remarkably,the search by our SDAS is around 2-fold faster than DAS.When transferring the learnt cells on Kinetics-10 to large-scale Kinetics-400 datasets,the constructed network also outperforms several state-of-the-art hand-crafted structures.
Keywords/Search Tags:Convolutional Neural Networks, Video Representation Learning, Video Classification, Action Recognition, Network Architecture Search
PDF Full Text Request
Related items