Font Size: a A A

Research On Visual Attention Model Based On Instance-level Competition

Posted on:2022-01-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:H LiFull Text:PDF
GTID:1488306602493644Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
The human visual system can quickly locate the most attractive parts of complex scenes,which is called selective visual attention mechanism.On one hand,the investigation of visual attention can help people understand the working mechanism of human visual system.On the other hand,the modeling of visual attention can be applied to the allocation of limited computation and transmission resources to the parts with important information in the images and videos to improve the efficiency of image processing and computer vision tasks.Although existing visual attention models have achieved good performance,there is still a gap between them and human behavior.Especially when the scene content includes multi-ple objects,these models cannot accurately allocate attention to different instances of objects belong to a same category.The main reason for this gap is that existing models are not able to effectively represent each instance in the scene.Object is general reference to individ-ual with semantic category information in the scene.Instance of object(instance for short)is general reference to individual with more comprehensive and detailed semantic descrip-tions based on semantic categories.Studies have shown that humans allocate attention to each instance in the scene according to the differences between instances.Therefore,in-stance features of objects are introduced to provide a more effective representation for each instance.In stance features can distinguish differences between different instances even if they are with a same semantic label.In a static scenario,instance features should contain intra class,inter class,and spatial contextual information of an instance.In a dynamic scene,instance features should contain semantic information,motion,and temporal contextual fea-tures of an instance.Constructing the attention competition relation among instance features to enable the network can learn the ability of allocating attention to different instances in mul-tiple objects scenarios.Deep convolutional neural network is used to model image saliency,video saliency,and saccade scanpath model separately,all of which can handle instance level attention competition.Specifically,the main contributions include the following aspects.First,a saliency prediction model based on densely connected convolutional neural network is proposed.In response to the problem that the existing deep learning-based models cannot handle attention distribution of multiple objects well,we propose a multi-scale dilated dense convolutional neural network to deal with the attention distribution problem with multiple objects to better predict the saliency.In the proposed network structure,the densely con-nected convolutional module encodes the inter- and intra-class features of objects to com-pete for instance-level attention.Second,dilated convolution increases the receptive field of neurons to collect contextual information to enrich the instance features of objects.Finally,a skip-connection is also introduced to provide multi-scale features for cross-scale attention competition,which helps to deal with natural scenes containing objects of different scales.Experimental results on three datasets of SALICON,CAT2000,and MIT1003 show that the proposed model realizes the accurate allocation of attention in multiple objects scenarios and achieves more accurate saliency prediction results.Meanwhile,the model has good gener-alization ability.Second,a video saliency prediction model based on 3D convolution encoder-decoder is pro-posed.In response to this problem that the existing deep learning-based models do not dynamically allocate saliency to different instances with motion information well enough,we propose an asymmetric 3D full-convolutional encoder-decoder network to deal with the saliency allocation and attention shifts with motion information on multiple objects scenes.In the proposed network structure,the encoder consists of two subnetworks to extract both spatial and temporal features,respectively.At different stages of the encoder,the features from different domains are merged to form the spatio-temporal features of objects for instance-level attention competition.The decoder decodes the spatio-temporal features of objects in the spatial dimension and aggregates temporal information in the temporal dimension to obtain the temporal contextual information and construct the attentional competition rela-tionship of instance features.Specially designed structures can transfer pooling indices from encoder to decoder,which helps the generation of location-aware saliency maps.The pro-posed model can be trained and inferred in an end-to-end manner.Experimental results on benchmark dataset DHF1 K show that the proposed model improves the prediction accuracy of saliency allocation and attention shift in multiple objects dynamic scenarios.Thirdly,a visual saccade scanpath prediction model based on convolutional encoder-decoder network with skip-connection is proposed.In response to both sparsity of scanpath predictive location and limitation of the lack of large,annotated datasets,we propose an unsupervised representation learning method based on convolutional encoder-decoder reconstruction to realize prediction of central region content based on peripheral information.Firstly,high-level semantic features of objects are learned by abstracting low-level local image features layer by layer through the proposed convolutional neural network.The context information of the objects is introduced at the decoder through the skip-connection,and the instance fea-ture representations of the objects are obtained.Secondly,for each image,under a unified representation learning mode,the input of the network contains stimuli or partial stimuli of different instances,so that an overall competitive relationship can be constructed for the in-stance features of these stimuli.The network can learn better reconstruction results based on instance-level attention competition.The difference between the predicted and actual con-tent of central region is regarded as the perceptual residual,which reflects the perception of the visual system on the image content and is a measure of saliency.Finally,the saccade path is predicted under the existing iterative representation learning framework.Experimen-tal results on MIT1003,Toronto,and OSIE datasets show that the proposed model improves the performance of prediction.In summary,based on the attention competition with instance-level,the studies cover from visual attention prediction in static scenes to visual attention in dynamic scenes,and then to dynamic process prediction of visual attention.Convolutional neural network architec-tures are used to obtain instance feature representations of different objects,respectively,to model static image saliency,dynamic video saliency,and dynamic saccade processes of visual attention,and comprehensively study the whole process of human visual attention.
Keywords/Search Tags:Visual Attention, Saliency, Visual Saccade, Instance Features, Attention Competition, Convolutional Neural Network
PDF Full Text Request
Related items