| Visual object tracking is one of the most important research topics in the field of computer vision,and has a very wide range of applications in video surveillance,autonomous driving,intelligent transportation,human-computer interaction,and military reconnaissance.What it needs to deal with is to continuously estimate the position and scale information of the target of interest in a video sequence,so as to serve high-level video analysis and understanding.Although object tracking algorithms have made considerable progress,there are a series of problems such as illumination variation,occlusion,fast motion,deformation,scale variation and background clutters in complex and changeable actual scenes,and these dynamic factors make it still a great challenge to achieve high-precision and robust visual tracking.In recent years,with the rapid development of deep learning,it has also been well applied and developed in the field of object tracking.At the same time,benefiting from massive video data and continuous improvement of computer software and hardware computing capability,data-driven deep tracking algorithms show a significant performance advantage.Therefore,based on a detailed analysis of the existing deep learning-based object tracking theories and methods,this thesis conducts in-depth research from four perspectives: effective representation learning,robust similarity matching,accurate bounding box prediction and efficient motion estimation.The main research contents and innovations are as follows:1)Aiming at the problem of insufficient representation of shallow features and high computation of deep features in deep learning based tracking algorithms,a visual attention guided residual learning object tracker(CSART)is proposed.This method firstly studies the impact of visual features extracted by different backbone networks on tracking accuracy and efficiency.Secondly,self-attention modules are used to capture the long-range semantic dependencies of basic feature in spatial and channel dimensions,respectively,thereby obtaining two contextaware attention feature maps.Then,we apply the idea of residual learning to adaptively fuse them with the original feature for learning more discriminative feature representation of foreground and backgrounds effectively.In addition,a multi-task loss function is designed to jointly optimize the entire appearance model in an end-to-end manner,while also alleviating the imbalance problem of training samples.A large number of experimental results indicate that the proposed tracking algorithm significantly improves the accuracy and robustness of the original deep learning based model without reducing the tracking speed.2)Siamese network based trackers generally use simple cross-correlation to perform feature matching between template and search region,but this fixed linear method is difficult to deal with the background noise problem and limits its discriminative ability.This work proposes a robust tracking framework based on Siamese relation networks,which exploit relation network to model a non-linear and learnable similarity metric function.It can be well integrated into the existing Siamese tracking methods,enabling a coarse-to-fine two-stage matching process.During inference,an online tracking strategy based on dual template matching is proposed.Keeping the initial template and utilizing the feedback from high-confidence tracking results to acquire and update the dynamic template,which further improves the robustness and online adaptability of the model.Experimental results show that the proposed tracker can effectively cope with some problems such as target appearance change and background clutters.3)Aiming at the inconsistency between classification and regression tasks and inaccurate predicted boxes in most anchor-free trackers,an Uncertainty-Aware Siamese Tracker(UAST)is proposed.This method mainly studies the uncertainty and ambiguity representation of bounding boxes,exploiting the regression vector to directly model the discrete probability distribution for four offsets of the target box,and calculating the integral of each distribution to capture flexible and informative bounding box representation.Secondly,it can estimate the certainty scores of each boundaries based on the probability of the predicted value’s neighbors,thereby achieving uncertainty estimation in tracking.At the same time,considering the high correlation between uncertainty and regression accuracy,a joint classification-localization quality representation head is designed to solve the problem of misalignment between classification and regression.In addition,a dynamic label assignment strategy is developed for high-quality object tracking.The experimental results on five public tracking benchmarks show that UAST performs better than many state-of-the-art trackers,and it is more reliable for practical vision systems.4)Considering the limitations of current classification-based deep trackers,such as inefficient search efficiency and time-consuming online fine-tuning or updating,a visual object tracking model based on hierarchical deep reinforcement learning is proposed.This work redefines the problem modeling of target tracking,that is,showing how to teach machines to imitate human behavior paradigm(several dynamic iterative searches)to perform the tracking task.Specifically,we construct a feature observation network,a policy network,an actor-critic network,as well as a Long-Short-Term Memory(LSTM)module utilized to model the temporal information of target motion,and follow the definition of Markov Decision Process(MDP)to learn hierarchical decisions about tracking modes and motion estimation via deep reinforcement learning algorithms.In order to improve the efficiency of online tracking,an expert tracker is also introduced to guide the model update and re-initialization process.Extensive experimental results demonstrate that the proposed algorithm achieves good tracking performance under various challenging factors,and performs well in terms of robustness and accuracy. |