Font Size: a A A

Research On Visual Attention Detection And Salient Object Segmentation

Posted on:2019-03-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:W G WangFull Text:PDF
GTID:1488306470493584Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Human visual system has an astonishing ability to quickly select visually important regions in its visual field.This selective process enables humans to easily interpret complex scenes in real time.Object segmentation,a fundamental task in computer vision research field,serves as a bridge between low-level vision processing and high-level vision understanding.Although a large amount of algorithms have been proposed for extracting objects from the background,the underlying mechanisms of the object segmentation,especially unsupervised model,were still few touched.In this thesis,we study visual attention mechanism and explore how to develop visual saliency as prior knowledge for guiding object segmentation.Specially,this thesis develops a series of data-driven models for visual saliency detection and saliency-guided object segmentation,which are summarized as follows:1.We build a novel neural network called Attentive Saliency Network(ASNet)that learns to detect salient objects from fixation maps.The fixation map,derived at the upper network layers,captures a high-level understanding of the scene.Salient object detection is then viewed as fine-grained object-level saliency detection and is progressively optimized with the guidance of the fixation map in a top-down manner.ASNet is based on a hierarchy of convolutional LSTMs(conv LSTMs)that offers an efficient recurrent mechanism for sequential refinement of the segmentation map.Several loss functions are introduced for boosting the performance of the ASNet.Extensive experimental evaluation shows that our proposed ASNet is capable of generating accurate segmentation maps with the help of the computed fixation map.Our work offers a deeper insight into the mechanisms of visual attention and narrows the gap between salient object detection and human fixation prediction.2.We present a geodesic distance based technique that provides reliable and temporally consistent saliency measurement of superpixels as a prior for pixel-wise labeling.Using undirected intra-frame and inter-frame graphs constructed from spatiotemporal edges or appearance and motion,and a skeleton abstraction step to further enhance saliency estimates,our method formulates the pixel-wise segmentation task as an energy minimization problem on a function that consists of unary terms of global foreground and background models,dynamic location models,and pairwise terms of label smoothness potentials.We perform extensive quantitative and qualitative experiments on benchmark datasets.Our method achieves superior performance in comparison with the current state-of-the-art in terms of accuracy and speed.3.We propose a deep learning model to efficiently detect salient regions in videos.It addresses two important issues:(1)deep video saliency model training with the absence of sufficiently large and pixel-wise annotated video data;and(2)fast video saliency training and detection.The proposed deep video saliency network consists of two modules,for capturing the spatial and temporal saliency information,respectively.The dynamic saliency module,explicitly incorporating saliency estimates from the static saliency module,directly produces spatiotemporal saliency inference without time-consuming optical flow computation.We further propose a novel data augmentation technique that simulates video training data from existing annotated image datasets,which enables our network to learn diverse saliency information and prevents overfitting with the limited number of training videos.Leveraging our synthetic video data(150K video sequences)and real videos,our deep video saliency model successfully learns both spatial and temporal saliency cues,thus producing accurate spatiotemporal saliency estimate with improved speed.4.We contribute to video saliency research in two ways.First,we introduce a new benchmark for predicting human eye movements during dynamic scene free-viewing,which is long-time urged in this field.Our dataset,named DHF1K(Dynamic Human Fixation),consists of 1K high-quality,elaborately selected video sequences annotated by 17 observers with eye tracker equipment.Those videos are captured from spanning a large range of scenes,motions,object types and background complexity.Existing video saliency datasets lack variety and generality of common dynamic scenes and fall short in covering challenging situations in unconstrained environments.In contrast,DHF1 K makes a significant leap in terms of scalability,diversity and difficulty,and is expected to boost video saliency modeling.Second,we propose a novel video saliency model that augments the CNN-LSTM network architecture with an attention mechanism to enable fast,end-to-end saliency learning.The attention mechanism explicitly encodes static saliency information,thus allowing LSTM to focus on learning more flexible temporal saliency representation across successive frames.Such a design fully leverages existing large-scale static fixation datasets,avoids overfitting,and significantly improves training efficiency and testing performance.We thoroughly examine the performance of our model,with respect to state-of-the-art saliency models,on three largescale datasets(i.e.,DHF1 K,Hollywood2,UCF sports).Experimental results over more than1.2K testing videos containing 400 K frames demonstrate that our model outperforms other competitors and has a fast processing speed(40fps;including all steps on one GPU).
Keywords/Search Tags:visual attention mechanism, human fixation prediction, salient object detection, object segmentation
PDF Full Text Request
Related items