Font Size: a A A

Video Saliency Detection Based On Improved Attention Network And Data Augmentation

Posted on:2022-10-18Degree:MasterType:Thesis
Country:ChinaCandidate:X WangFull Text:PDF
GTID:2518306485466324Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the information acquired by human beings is growing explosively.If we can simulate the human Visual Attention Mechanism and pre-process the vast amount of information acquired to find out the most important parts,we can save a lot of computing resources.The salience goal is the most attractive part of the natural scene.Saliency detection model aims to automatically find salient objects in the input image/video.Saliency Object Detection is one of the hot research directions in the field of computer vision.It aims to segment salient regions in images or videos,and plays an increasingly important role in public security video surveillance,automatic driving,video compression and object detection.In recent years,Fully Convolutional Neural Networks have become the mainstream method to solve saliency detection,and the best results can be obtained by training the network with a large amount of Ground Truth(GT)data.Although the existing models have achieved good results,there are still great challenges in complex scenes,which can be summarized as follows:(1)The significance model is complex and overdependent on GT,so it is difficult to be applied to real-time scenes,such as automatic driving and the Internet of Things;(2)In video saliency detection,the semantic features of video frames and the motion features between frames cannot be well integrated;(3)The commonly used BCE loss function failed to take into account the spatiotemporal information between labels.Therefore,based on deep convolutional neural network,in this thesis,we make in-depth research on how to reduce the dependence on GT and how to mine the relationship between motion and semantic features,and constructs a joint loss function(STSS).The main research results of this thesis are as follows:(1)We propose a simple yet efficient architecture,termed Guidance and Teaching Network(GTNet),to independently distil effective spatial and temporal cues with implicit guidance and explicit teaching at feature-and decision-level,respectively.To be specific,we(a)introduce a temporal modulator to implicitly bridge features from motion into appearance branch,so as to fuse cross-modal features collaboratively,and(b)utilize motion-guided mask to propagate the explicit cues during the feature aggregation.This novel learning strategy achieves satisfactory results via decoupling the complex spatial-temporal cues and mapping informative cues across different modalities.In three public data sets compared with TENet algorithm,F?(bigger is better)increased by 0.4%,0.1% and 4.4% respectively;Sm(bigger is better)improved by 0.7%,0.9% and 2.1%,respectively.(2)We propose a video saliency detection model based on synthetic data training.We first use still images to synthesize large-scale video frames and their associated pseudo-labels.Then,since the loss function of the existing video salience detection model ignores the spatiotemporal information between tags,we propose a new joint loss function(STSS)to train our VSOD model on the synthesized data.In this model,the trained residual network(Resnet50)was firstly used to extract the high-level semantic features of the composite video frame sequence,and then the continuous semantic features were put into the Conv GRU network to obtain the spatial and temporal features.Finally,an end-to-end mode deep learning network was constructed using the learned features.Compared with the fully supervised FCNS algorithm,the MAE(smaller is better)was reduced by 1.9%,1.5% and 1.3% on the three public datasets.Sm(bigger is better)improved by 4.6%,2.4% and 0.2%,respectively.
Keywords/Search Tags:Image/Video Saliency Object detection, Visual Attention Mechanism, Fully Convolutional Neural Networks, Deep Learning
PDF Full Text Request
Related items