Font Size: a A A

Research On Video-based Temporal Action Localization And Recognition

Posted on:2022-10-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:R H ZengFull Text:PDF
GTID:1488306569459034Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of cloud computing and mobile internet technology,video has be-come the main carrier of information and is growing explosively.In the face of a large number of videos,how to automatically,accurately,and efficiently understand human actions in videos is a hot topic in current artificial intelligence research.This thesis studies video-based tempo-ral action localization(TAL),aiming to automatically localize the start and end time of actions and identify their category,which is crucial for many downstream video analysis tasks,such as surveillance analysis,video retrieval,and video Question&Answering.However,video has a large volume,a high dimensionality,and complex contents.The challenges of TAL are as fol-lows: 1)The temporal information in videos is complex and temporal features are hard to extract,making it difficult to localize the temporal positions of actions.2)Due to the scene changes and noise interference,the spatiotemporal information in videos is hard to model,which increases the difficulty of extracting features for recognizing actions.3)The annotation of videos in TAL is often sparse and contains only a little information,which makes the training of neural network models very difficult.4)The video annotation process is time-consuming and laborious,and the annotation results are often subjective.To address the above-mentioned issues,this thesis focuses on efficient feature extraction and efficient training methods in TAL and proposes a series of new methods.The innovations and contributions are summarized as follows:1)Aiming at the difficulty in temporal feature extraction,a graph convolutional network-based TAL algorithm is proposed.Based on the temporal relations between video contents,this thesis proposes to use the correlation between action proposals to achieve efficient temporal feature extraction.To model the relations between proposals,a graph construction method is designed.Then,a temporal feature extraction algorithm relying on graph convolution is proposed,which uses graph convolutional networks to aggregate the interactive information between proposals,and makes use of the context information to enhance the temporal fea-tures of proposals.Experimental results show that the proposed method effectively models the correlation between proposals,and significantly improves the accuracy of TAL.2)To extract action features under the noise/background interference,this thesis proposes an audio and video content-based temporal action localization algorithm.Specifically,a fea- ture extraction module based on the audio attention mechanism is proposed,which uses the audio signal contained in the video to guide the model to focus on the spatial regions related to actions.To exploit the information between two modalities(i.e.,audio and vision),this thesis proposes a cross-modal relational attention module to capture the correlation between audio and vision to enhance action features.Experimental results on the audio-visual event localization task show that the proposed method effectively extracts action features despite noise/background interference,and improves the accuracy of TAL eventually.3)To tackle the problem of sparse annotations in videos,this thesis proposes a dense regression mechanism-based TAL algorithm.Specifically,a temporal boundary regression module based on a dense regression mechanism is proposed.The dense regression module signif-icantly increases the number of positive training samples without changing the annotated information,providing a new perspective on how to mine effective supervision informa-tion from sparse annotated data.This thesis also proposes a regression module based on Intersection over Union(Io U),which further refines the localization results.Experimental results show that,compared with algorithms that directly use sparse annotations for train-ing,the proposed method effectively mines more supervision information,and significantly improves training efficiency and inference accuracy.4)In view of the time-consuming and laborious process of video annotations,this thesis pro-poses a weakly supervised temporal action localization algorithm to train neural networks without temporal annotations.Specifically,this thesis proposes a method based on the anti-erasing mechanism to iteratively erases the most discriminative segments in the training so that the model can accurately locate segments that are less discriminative.Next,a class-specific importance weight calculation method is proposed,which is able to achieve precise localization for each type of action.Experimental results show that the proposed method is able to localize actions without frame-level temporal annotations,and achieves comparable performance with the methods that use temporal annotations.
Keywords/Search Tags:Video Analysis, Temporal Action Localization, Action Recognition, Graph Con-volutional Networks, Weakly Supervised Learning
PDF Full Text Request
Related items