Font Size: a A A

Research On Deep Temporal Feature Learning Algorithm Based On Self-supervised

Posted on:2022-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:J L KangFull Text:PDF
GTID:2518306527455354Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Videos can provide richer visual features than images,and the spatio-temporal features extracted from videos can be applied to many visual tasks,such as video retrieval,action recognition,etc.In the existing model training strategies,videos are input into networks randomly to learn spatio-temporal features.However,we find out a truth that videos have different levels of frame/clip sequence saliency,and it is easier to identify the correct frame/clip orders of the videos with high levels of frame/clip sequence saliency than those with low levels of frame/clip sequence saliency.Therefore,we believe that the effective utilization of frame/clip sequence saliency would be beneficial to spatio-temporal feature learning and improve the performance of related visual models.The contents and innovations of this thesis mainly include:1.We propose a new concept called video sequence saliency(VSS)to measure the degree of difficulty of visual models to identify the correct frames/clip orders of videos.Accordingly,we develop a novel method named progressive self-supervised spatiotemporal feature learning based on VSS(PSSFL-VSS).The algorithm includes two stages: model pre training and model transfer.For the pretrain task of clip order prediction,the pre-training strategy is to input videos into networks in the descending order by VSS values,and 3D CNNs(C3D,R3 D and R(2+1)D)are used to learn spatiotemporal features from videos.Firstly,we update the VSS value of each video based on clip order prediction results;then the videos are ranked according to the updated VSS values;after which a hyper-parameter is set to divide the ranked videos into several video groups,which will be entered into networks for training in descending order of VSS,rather than randomly as in traditional methods.The VSS value of each video and video ranking are updated at each iteration until the model converges.All experimental results show that,compared with the baseline results,the accuracy of the proposed algorithm is improved by 2.9%.There is also an obvious improvement in video clips retrieval,,video retrieval and action recognition,which verified the effectiveness and superiority of proposed models.2.To solve the difficulty in the effective learning of temporal features in video generation task,this thesis improves Self-supervised Spatio-temporal Feature Learning Video Generative Adversarial Networks(SSFLVGAN).In generator network G,we use L2-regularized loss function to solve the over fitting problem.In discriminator network D,3D average pooling layer is added after the first four convolution layers to reduce the model parameters,so as to distinguish the synthetic video and real video and identify whether the temporal relationship of motion between frames is correct.All the video generation experimental on related datasets show that,compared with the baseline results,the evaluation index of this algorithm has been effectively improved.The generated video of SSFLVGAN is more realistic.
Keywords/Search Tags:Self-supervised Learning, Spatio-temporal Feature Learning, Clip/Video Retrieval, Action Recognition, Video Generation
PDF Full Text Request
Related items