Font Size: a A A

Deep Learning Based Temporal Action Localization

Posted on:2022-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:X P DingFull Text:PDF
GTID:2518306602489854Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Temporal action localization aims to identify the categories of action in the long untrimmed video and locate the start and end time of the action.Since its wide range of application values in social public safety and digital entertainment,temporal action localization has received more and more attention in recent years.For example,in public safety,TAL can be used for real-time monitoring,thereby reducing a lot of labor and money.In digital entertainment,it can be used as a sports video highlight detection.The main challenge of temporal action localization are the following four aspects.(1)There are many noisy frames in videos,which makes it difficult to localize actions.(2)Due to the varies duration of the actions(e.g.,from 0.1s to 30s)in videos,how to generate flexible temporal boundaries is very important.(3)Although existing fully supervised methods have achieved good performance,it is very time-consuming and labor-intensive to obtain precise temporal boundaries.In order to reduce the cost of annotation,many works adopt weakly supervised setting which use video-level action category annotations.However,their performance is far away from the fully supervised ones.(4)The categories of TAL are restricted to pre-defined list of actions,which is not flexible enough in practical applications.In order to solve the above problems,the main content and contributions of this thesis are summarized as follows:1.A frame centroid radiation network for temporal action localization is proposed to model the relations between frames explicitly.Existing methods ignore the intra and inter relation of actions,i.e.,features in actions with the same category are similar while those in action with different categories are dissimilar.In this thesis,we introduce a centroid radiation network to explicitly model the relation between pair-wise frames for temporal action localization.By using intra and inter relations of action instances,the centroid radiation network can generate flexible temporal boundaries with background and action frames.Firstly,a relation network is introduced to model the similarity of features of pair-wise frames.Then we adopt centroid network to find the centroid of each action instance,and classify its category.Under the rationale that a centroid and its neighbor frames with high affinity are likely to belong to the same action class,we finally adopt a random walk algorithm to generate instance boundaries by exploiting the affinity of each pair of frames and centroids.Experimental results show the effectiveness of our method.2.A frame-level weakly-supervised method for temporal action localization is proposed to largely reduce the cost of annotations.Since the cost of video-level action category based weakly supervised methods is much less than that of precise temporal boundaries(fully supervised methods),the performance of localization of the former are far away from that of the latter.In this thesis,we introduce a new frame-level weakly supervised methods: only randomly label one or two frames in the actions.Since without precise starting and ending time of temporal annotations,frame-level ones can largely reduce the cost of annotations while improving more localization information.We devise a localization module which uses the frame-level labels with partial frame loss,sphere loss and propagation loss to enhance the performance of localization.Experiments validate that our method can outperform the video-level supervision methods with almost same the annotation time.3.A K-farthest crossover based semi-supervised temporal action localization method is proposed to achieve the competitive performance with a few labelled samples.Semisupervised approaches employing consistency regularization have achieved great success in image classification problems,which trains a model to be robust to the perturbed inputs.The success of consistency regularization is depended on the perturbations.Too small perturbations would not be enough to train a robust model,while too large ones would alter the semantics of original features.However,the perturbations in image or video classification,i.e.,flip or clop,are not fit to apply to temporal action localization.Since videos in temporal action localization are too long,it is impossible to train an end-to-end model with raw videos.In this thesis,we devise a method named K-farthest crossover to construct perturbations based on video features.Motivated by the observation that features in the same action instance become more and more similar during the training process while those in different action instances or backgrounds become more and more divergent,we add perturbations to each feature along temporal axis and adopt consistency regularization to encourage the model to retain this observation.Experiments indicate our method can improve the performance of semi-supervised temporal action localization,even can reach the fully supervised ones.4.A cross-modal capsule pyramid network for temporal language localization is proposed to address the inconsistency between videos and text.Temporal language localization,which is also named language based temporal action localization,aims to localize the moment corresponding to the given nature language.Compared with temporal action localization requiring the pre-defined action list,temporal language localization is more flexible.How to extract the relations between text and video is the key of this task.Existing methods only fuse textual and visual features in local or global manner,which leads to inconsistent correspondence.In this thesis,we propose a cross-modal capsule pyramid network to capture text-video correspondence in multiple scale levels.Firstly,we extract a set of visual and textual features respectively and then,fuse them from local to global by capsule routing.Finally,we capture the cross-model correspondence from local-to-global via feature pyramid network.Experiments validate the effectiveness of our proposed method.
Keywords/Search Tags:Temporal action localization, semi-supervised learning, weakly supervised learning, multi-modality, centroid radiation network, capsule network
PDF Full Text Request
Related items