Font Size: a A A

Learning Visual Attention And Robust Deep Feature For Object Detection And Tracking

Posted on:2020-04-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:X WangFull Text:PDF
GTID:1368330602457343Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Object detection and tracking is the fundamental task of computer vision and also the key technique of intelligent video surveillance system.With the help of deep learning,the development of these domains is already amazing.However,object detection and tracking are still challenging tasks due to the complexity of data,scene,and environment.This paper target to handle these complex factors in object detection from the perspective of visual attention and robust deep feature learning.Specifically,the research and analysis can be divided into the following sections:adaptive weighted multi-modal salient object detection,target-driven visual attention generation for visual tracking,hard positive gen-eration for visual tracking,tracking by natural language specification and hard person identity mining for cross-camera tracking.Firstly,we propose the adaptive multi-modal information fusion mechanism.For the deep learning-based saliency detection algorithm,we propose a quality-aware multimodal salient object detection framework based on deep reinforce-ment learning.We take the adaptive weighting on different modal data as the decision-making problem.The proposed algorithm is validated on two kinds of dual-modal saliency detection benchmarks.Secondly,we propose to jointly utilize the global and local candidate samples to handle the issues existed in the current tracking-by-detection framework,such as heavy occlusion,scale variation,and reappearance.Specifically,we achieve a global proposal generation via target-driven visual attention maps.To better capture the motion information,we use 3D CNN to extract features from several continuous video frames.Meanwhile,we also obtain the features of the target object with 2D CNN.These two features are concatenated together and input to the up-sample network.This network is trained with mean squared error and adversarial loss function.The training data can be obtained from the existing tracking dataset without any additional annotation.We first obtain the rectangle regions according to saliency regions and conduct Gaussian sampling.In the tracking procedure,the global and local proposals are all input into the classifier and the proposal with maximum score will be chosen as the result of the current video frame.A short and long-term update strategy is adopted to update the model.Extensive experiments on multiple tracking benchmarks validated the effectiveness of the proposed algorithm.Thirdly,few-sample learning is another key issue in visual tracking.How-ever,the deep learning network only works well when trained with large-scale data.Hence,there exists a gap between a few-sample visual tracking task and a data-hunger deep neural network,which may limit the tracking performance.Besides,the short of hard samples in practical training datasets also make their trackers not robust to challenging factors.To handle these issues,this paper proposes to actively generate massive hard samples to bridge this gap.Specif-ically,this paper constructs the manifold of the target object with variational auto-encoder,then decoding massive positive target object images.Meanwhile,we also use a background patch to occlude the target object to make the tracker more robust to occlusion.Massive hard training samples can be obtained via the aforementioned techniques and this will make the baseline tracker performs better.Fourthly,the current most popular setting of visual tracking is initialized with one bounding box to represent the target object in color images.Howev-er,only the appearance model is not enough in a practical tracking procedure,especially when facing complex background,fast motion,etc.In this paper,we take the structural information between training samples into consideration with graph convolutional networks and also introduce the natural language specifica-tion for more robust deep feature learning.Also,we adopt the encoder-decoder framework to generate the global attention map based on natural language and target object patch to deal with reappearance,fast motion,heavy occlusion,etc.Our experiments validate that the tracking performance can be enhanced significantly with the guidance of natural language.Fifthly,for the person tracking problem under the cross-camera scenario,one regular pipeline is to use triplet loss for feature learning and compare the difference between human image in the feature space.They adopt local mini-batch construction and also ignore the correlations between the average feature of the same personal identity and each image feature,which may limit their final recognition results.This paper first estimates the attributes of each image in the person re-identification dataset.Then,the attribute distance between different pedestrians can be measured for global mini-batch construction.In the training phase,this paper considers the correlations between the average feature and each image feature of the same person,and use it as an additional criterion to optimize the neural network.We add this regularization term into the triplet loss function.Extensive experiments on both pedestrian attribute recognition and person re-identification datasets validated the effectiveness of this algorithm.
Keywords/Search Tags:Visual Attention, Deep Feature Learning, Salient Object Detection, Visual Tracking, Multi-modal Fusion
PDF Full Text Request
Related items