Font Size: a A A

Action Localization And Recognition Based On Temporal Analysis

Posted on:2022-03-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:F C LongFull Text:PDF
GTID:1488306323964339Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the advent of Web 3.0,advanced technologies lead to the surge of artificial intelligence in research community,such as big data,mobile Internet,Internet of Things and parallel computing.The multimedia applications in our daily life are the hotspot in research field of computer science.Compared to static images,videos carry motion and auditory information,making such media more complex,and thus the temporal dynam-ics in videos are unique and critical to video analysis.The research in video analysis has proceeded along several directions,such as video object detection,video captioning and temporal action recognition and localization,etc.In between,temporal action recogni-tion and localization are necessary for the development of human-computer interaction.The technology allows machine to understand and recognize human behaviors,which benefits various tasks of robots.However,due to the rich content of action video,the naive algorithm such as sliding windows to segment videos will produce a lot of redun-dant candidates,in which the temporal structure is also not well explored.Meanwhile,acquiring temporal annotations of action is very expensive which limits the capacity of localization models.How to leverage the limited temporal annotation of actions to enlarge the scalability of action localization model is another urgent problem.In order to solve the above two problems,this thesis starts from the analysis of temporal structure of videos,and then delves into the hierarchical structure,temporal scale and generalization ability of action localization/recognition models.The thesis proposes the methods of coarse-to-fine action proposal networks,Gaussian temporal awareness networks,localization based on domain transferring and weakly supervised pre-training of network backbone.The contributions are summarized as:(1)By exploring the temporal hierarchical granularities of actions,we propose to localize temporal action proposals in a "coarse-to-fine" manner.To materialize this idea,we proposal a coarse-to-fine temporal action proposal approach.The approach first models action proposals with three different actionness curves(namely pointwise,pairwise,and recurrent curves)to produce coarse action proposals.Then a 1D con-volution neural network is employed to refine temporal boundaries in a fine-grained manner.Finally,a proposal re-ranking network is devised to identify proposals from the two stages.Compared to the proposal model only in coarse level,our method lead to 2.5%and 4.1%performance gains on average recall and AUC,which demonstrates the effectiveness of the proposed coarse-to-fine manner for temporal action proposal.(2)To address the problem of predtermined temporal scales in traditional one-shot action localization model,we introduce to predict a particular interval of each proposal dynamically by Gaussian Temporal Awareness Networks.Through learning Gaussian kernels for each cell of the feature map,the temporal scale of the temporal action pro-posal is dynamically optimized.Multiple Gaussian kernels which are highly overlaped with each other could even be mixed to capture action proposals with arbitrary length.Moreover,the values in each Gaussian curve reflect the contextual contributions to the localization of an actions proposal.Extensive experiments are conducted on both THU-MOS14 and ActivityNet v1.3 datasets and the proposed approach achieves 1.9%and 1.1%improvements in mAP on testing set of the two datasets.(3)For the improvement of the category scalability of action localization model,we introduce a new design of transfer learning type to learn action localization for a large set of action category,but only on action moments from the categories of interest and temporal annotations of untrimmed videos from a small set of action classes.In detail,we bridge the relation between temporal action localization and moments recognition through a weight transfer function and hallucinate the context of the action moments for localization training.In this work,we successfully extend action localization to 600 categories by utilizing moment data in Kinetics-600 dataset.(4)Since the network backbone is usually fixed during localization model training,the performances largely depend on the generalization ability of the backbone.In the thesis,we introduce a weakly-supervised method for network backbone training by uti-lizing the large-scale web video data.However,there exists two issues of web videos,i.e.,"query ambiguity"(uncertainty of meaning or search intention)and "text isomor-phism"(same syntactic structure of different text).Solely capitalizing on such supervi-sion will mislead the video representation learning and we propose a Twin-Turbo Net-works to calibrate across each other for more accurate supervision.On various datasets of the downstream action recognition task,weakly-supervised pre-training TTN leads to 2.8%,1.9%and 2.7%gains in top-1 accuracy on Kinetics-400,Something-Something V1&V2 datasets over the best competitor with fully-supervised ImageNet pre-training.
Keywords/Search Tags:Temporal Action Proposal, Temporal Action Localization, Transfer Learning, Network Pre-training, Action Recognition
PDF Full Text Request
Related items