| Multiple object tracking is an essential and difficult task in the field of computer vision.In recent years,due to its application value and potential in various fields,such as intelligent surveillance and autonomous driving,multiple object tracking has attracted a lot of research attention.Currently,deep learning-based trackers can be mainly divided into two-stage methods and one-shot methods.The latter has become a popular research topic due to its lower computational cost and faster tracking speed.However,the robustness of one-shot trackers is usually lower than that of two-stage methods.One-shot methods often encounter serious issues such as bounding box drift and identity switches in complex scenarios.Designing one-shot trackers that can achieve a balance between tracking accuracy and speed is still a challenging research problem.Therefore,this thesis analyzes the shortcomings of existing networks and the causes of tracking failures,and proposes improvements to enhance tracking accuracy and speed.The main contributions of this thesis are summarized as follows:(1)From the perspective of alleviating the optimization contradiction with one-shot networks,this thesis designs a novel tracker FPUAV that learns task-specific features.This thesis analyzes the essential differences and inherent conflicts between subtasks in one-shot networks and proposes two targeted sub-networks.Firstly,a novel feature decoupling network is proposed for learning features that satisfy the required representations for each task based on self-attention and cross-attention.Then,a pyramid Transformer encoder is designed to predict scale-aware fine-grained features,based on multi-scale learning and Transformer,in order to enhance the ability of the trackers to locate targets under complex scenarios.(2)To alleviate the issue that existing trackers struggle to maintain the identity information of targets for a long time,this thesis elaborately designs a feature extraction branch and proposes GCEVT.The feature extraction branch consists of two sub-modules,namely the pyramid fusion network and the channel-wise Transformer enhancer.To alleviate the semantic misalignment caused by the target scale transformation,the former obtains features with both low-level and high-level information.To achieve this,the network captures pixel-level long-range dependencies of features at different scales and alleviates the feature semantic misalignment.The latter models the channel attention mechanism as a self-attention structure to interact features across channels with pixel information.The proposed GCEVT effectively boosts tracking robustness by capturing the features with global semantic information.(3)In this thesis,the proposed FPUAV and GCEVT are evaluated on several datasets.The tracking performance of the trackers are measured under various challenging scenarios for pedestrian tracking and vehicle tracking tasks.Extensive experimental results demonstrate the effectiveness of the proposed methods and the high accuracy of FPUAV and GCEVT.Compared to the baseline tracker,the accuracy metric MOTA and robustness metric IDF1 of FPUAV are improved by 6.7% and 5.6%,respectively.GCEVT further enhances the accuracy and robustness of FPUAV by 4.4% and 9.9%.Qualitative and quantitative experiments on multiple benchmarks show that the proposed methods outperform existing state-of-the-art trackers. |