Saliency detection aims at simulating the human visual attention mechanism to detect the most salient/distinct regions or objects in a visual scene.It is one of the most fundamental image pre-processing techniques and can benefit many high-level computer vision tasks.From the research target’s perspective,saliency detection can be divided into two branches: eye fixation prediction and salient object detection.The former only requires to highlight the most salient regions in an image,while the latter regards objects as research targets and requires both accurately localizing and precisely segmenting the most salient objects.Traditional methods usually extract simple features such as color and texture as image representation,then resort to the contrast mechanism to model saliency and combine top-down semantic cues to predict eye fixations.While for salient object detection,researchers usually combine contrast with some prior knowledge,such as the boundary prior and objectness prior,to localize salient objects.Then segmentation based techniques can be adopted to refine saliency maps.However,all these modelings are usually human-designed,which is highly limited by the insufficient understanding of the human visual attention mechanism.Thus,traditional saliency models are generally limited in model performance and generalization ability.To solve the problems mentioned above,in this thesis we propose to leverage the powerful deep neural networks(DNNs)for saliency detection.Benefitting from the end-to-end training with large scale saliency data,DNNs can automatically learn optimal saliency modeling elements in a data-driven manner,even including some elements that haven’t been noticed or considered in previous traditional methods.As a result,DNN based saliency models can achieve superior performance and generalize well on various visual scenes.Nevertheless,applying DNNs on the saliency detection problem is not non-trivial.There are still many directions need to be explored.In this thesis,we present four works to study different aspects of this technique:1.We first propose a multiresolution convolutional neural network(Mr-CNN)for eye fixation prediction,which is one of the earliest efforts for deep saliency modeling.It integrates low-level and high-level feature learning,contrast inference,and saliency prediction in a unified network.Specifically,Mr-CNN is trained and tested with each image pixel and directly outputs its saliency value.It is composed of three CNN branches,which respectively model the properties of the center region,the surround region,and the whole image for each pixel.As such,low-level and high-level features can be learned in shallow and deep layers,local and global contrast can be inferred by combining multiresolution region features.Finally,saliency values can be predicted by integrating all the saliency cues.Experimental results show that the proposed model significantly outperforms traditional methods and early deep saliency models.2.We further explore to explicitly modeling the global context and scene modulation for eye fixation prediction.Global context is widely recognized as a key element in saliency detection,which is typically reflected in the global contrast modeling.While scene modulation has also been demonstrated to influence saliency perception but is seldom studied.We propose to use a long-short term memory based model for explicitly propagating global context and the scene information for each pixel,which leads to a new state-of-the-art deep saliency model.Ablation studies also demonstrate the effectiveness of our key idea.3.For salient object detection,we present an end-to-end deep model,which is both effective and efficient.Different from early deep models that independently process image oversegmentation with sliding windows schemes,we directly input the whole image into the network and predict a coarse saliency map first.Then we leverage the intermediate CNN features to hierarchically refine the saliency map,incorporating finer and finer low-level features.In such a way,we generate saliency maps from the global view to local contexts,and from coarse scale to fine scales.The final results outperform previous models by a large margin,both in terms of accuracy and speed.4.Finally,we explore a novel idea to learn pixel-wise contextual attention for salient object detection.Unlike previous methods that model contextual information holistically,we propose to find useful context regions for each pixel by learning an attention map.Then we use the learned attention weights to only aggregate informative contextual features and filter out useless ones in pooling and convolution operations.By further considering both global and local contexts,we present three forms of the attention modules.We embed them in a UNet architecture to detect salient objects.Experimental results demonstrate that global and local attention learn global contrast and region smoothness,respectively,thus being able to promote saliency detection performance.As a result,our saliency model outperforms all other state-of-the-art algorithms.We also validate the generalization ability of the proposed attention models on other vision tasks. |