| A rich stream of visual data on the order of 10~8bits enters our eyes per second,which far exceeds the data size the brain can fully process and assimilate.Embraced by overwhelming amounts of complex visual input,the human visual system can still acquire the important information of scene effectively to increase the understanding of the external world.The main reason for that is the selective visual attention mechanism.The research on visual attention,on the one hand,can reveal more internal working mechanism of the human visual system.On the other hand,visual attention modeling can provide a feasible solution to the analysis and processing of large-scale data,with the rapid development of Internet and multimedia technology.Research has suggested that in the perceptual procedure,the gap between reconstruction and actual input can guide visual saliency.Therefore,based on the hypothesis of center-surround contrast,we compute center-surround contrast from the perspective of reconstruction to study the reconstructed-based bottom-up visual attention.The major contributions of this dissertation are as follows.Firstly,we introduce a saliency estimation algorithm which is based on linear sparse re-construction.Previous models are usually devoted to finding and selecting effective salient features.However,it is often difficult to find a set of features that are suitable for differ-ent scenes.To tackle with more diverse visual inputs,more and more types of features are integrated into models to highlight saliency from different perspectives.By contrast,we start from the computation of center-surround contrast and formulate saliency as mea-suring reconstruction residual of the central patch with nonlocal surrounding ones.With reconstruction residual,the identical components shared by the center and surround can be removed,thereby adaptively highlighting their difference in distinct situations.In the exper-iments on public data sets,as a purely data-driven model,the algorithm can outperform the models based on learning from human eye-movement data.Secondly,we propose a deep autoencoder-based saliency estimation model.In the exist-ing reconstruction-based methods,the reconstruction parameters of each area are calculated independently without taking their global correlation into account.For another,the recon-struction process is often linear,but research has shown that even at the earliest stages,the essential role of visual processing is still nonlinear calculation.Also,nonlinear models have stronger representation ability to handle more types of scenes.To solve these problems,we first construct a deep autoencoder-based nonlinear network to form the inference from surround to center.Then,we sample from an entire image to train the center-surround re-construction parameters of the current scene.Finally,the saliency of each point is defined as the reconstruction residual of the network.By fusing the competition in global sampling with the framework of reconstruction,the network is biased to the description of repeated redundant regions while highlights the usual parts.Experimental results demonstrate that in accordance with different inputs,the network can learn distinct basic features for saliency modeling in its code layer.Furthermore,the comprehensive evaluation on several bench-mark data sets can also show the advantages of the model.Thirdly,we propose a stereoscopic saliency estimation model based on background model-ing.Research has implied that depth is one of the essential properties in the human visual system and contributes to visual attention.Based on these findings,stereoscopic saliency models have been proposed to exploit depth information to complement the saliency of two-dimensional cues.However,the usage of depth and integration with other cues are usually simple.Motivated by the observations in three-dimensional environment,we first regard depth as a prior to estimate each region’s probability of being background.Then,we sam-ple pairs of patches from the possible background to train the autoencoder-based network and measure stereoscopic saliency according to the reconstruction residual of the network.With emphasis on learning from background,the model can better separate foreground from background to generate lower false positive rate.Experimental results demonstrate that the proposed method can outperform the state-of-the-art fixation prediction algorithms on sev-eral public data sets for stereoscopic saliency estimation.Additionally,it is effectively used for proto-object extraction.Fourthly,we propose a deep autoencoder-based saccadic scanpath prediction model.Most existing studies of visual attention have focused on the prediction of static saliency maps.However,the actual visual attention is a dynamic process,and hence modeling the saccadic scanpath formed by the shift of fixations is also indispensable,which has been neglected in previous investigation.Also based on the autoencoder-based reconstruction model,we constantly update the model parameters with the data from current fixation to model the dynamic visual perception.In the model,saccade can be explained as an iterative process of finding the most uncertain area and updating the representation of scenes.Compared with existing saccadic algorithms,the model can produce consistent results with human saccadic scanpaths.In conclusion,along the principal line of center-surround reconstruction,the dissertation studies from nonlocal linear sparse reconstruction to global nonlinear reconstruction,then from depth-modulated background-based reconstruction to constantly updating dynamic re-construction.In a unified framework,we simultaneously model the static saliency and dy-namic saccade to give rise to a more complete description and modeling of bottom-up data-driven attention in the human visual system. |