| Object tracking has found extensive applications in fields such as autonomous driving,intelligent monitoring,human-computer interaction,and augmented reality.Visual object tracking faces numerous challenges,including object deformations and occlusions during motion,illumination variations,and background distractions in tracking scenarios,which make accurate tracking of target objects difficult.To address these challenges,researchers have applied deep learning techniques to object tracking,with deep Siamese networks garnering considerable attention.Deep Siamese networks model the tracking problem as a similarity measurement task and effectively overcome the limitations of traditional object tracking algorithms,such as correlation filtering,in dealing with complex situations and real-time demands through large-scale offline learning.This paper presents a comprehensive investigation of the application of deep Siamese networks in visual object tracking and the developmental history of deep Siamese trackers.Although deep Siamese trackers can currently achieve precise tracking in simple scenarios,the accuracy and robustness of existing tracking algorithms require improvement when encountering challenging situations involving rapid motion,extended occlusions,and similar objects.In response to the challenges currently faced by deep Siamese networks in the field of object tracking,this paper offers the following research and contributions:1.To address the lack of training data and reliable datasets for evaluation of deep trackers,this paper proposes a large-scale,high-quality single-object tracking dataset-La SOT.The dataset contains 1,400 sequences,totaling over 3.5 million frames,with an average video length of more than 2,500 frames per sequence,presenting rich challenges.The goal of the La SOT dataset is to provide a high-quality benchmark dataset for the training of deep learning-based trackers and the evaluation of trackers.The La SOT dataset has become a standard benchmark and training dataset in the tracking community,effectively promoting the development of deep trackers.2.Transformers have greatly advanced object tracking with their ability to model longdistance dependencies,but existing work mainly focuses on integrating and enhancing features generated by convolutional neural networks(CNNs).The potential of Transformers in representation learning has not been fully exploited.This paper proposes a fully attention-based tracker based on the Swin Transformer-Swin Track.Swin Track uses Transformer structures for both representation learning and feature fusion,achieving better feature interaction than pure CNNs or hybrid CNN-Transformer frameworks.To improve robustness,Swin Track also introduces a lightweight motion token that provides temporal context information to enhance tracking robustness.Experimental results show that Swin Track outperforms existing methods on multiple tracking benchmarks.3.With the recent proposal of the empirical scaling laws for large model,many research efforts focus on expanding model and training data scales.Recently,large-scale pre-trained visual models have achieved significant performance improvements.However,the gap between the scale of state-of-the-art models and the scale most researchers can afford is constantly widening.To address this issue,this paper proposes a low training cost tracker based on Vi T-FVTrack,which achieves ultra-real-time tracking efficiency and can be trained within several GPU hours.To make the Vi T model suitable for object tracking tasks,this study makes several structural improvements and optimizations and introduces a token fusion strategy.Experimental results show that while maintaining high efficiency,FVTrack’s performance can also reach the current state-of-the-art.Finally,this article offers a glimpse into the future of object tracking.In future research,I will continue to devote myself to addressing the challenges of object tracking and focus on the following aspects: leveraging the strong discrimination and generalization capabilities of large models to address the model drift issue traditionally introduced by temporal appearance representation learning; combining language models,using the reasoning ability of language models and their world models,to address challenging issues in complex real-world scenarios. |