| Visual object tracking,as a research hotspot in the field of computer vision,has attracted the attention of many researchers in recent years.Object tracking is highly important for applications in industries such as autonomous driving,security monitoring,intelligent targeted medical care,and UAV reconnaissance.Visual object tracking aims to accurately estimate and predict of target position and size in subsequent videos,given the target state in the initial frame.Due to the presence of challenging factors such as deformation,occlusion,background clutter,and motion blur,the subsequent target state may be different from the initial target state.Therefore,how to improve the accuracy of object tracking under the influence of the above-mentioned challenging factors is a key issue.With the large-scale application of Transformer in the field of natural language processing and computer vision,it has also shown immeasurable research potential in the field of object tracking.The deployment and application of Transformer-based object tracking algorithms in the object tracking system are still in the preliminary research stage,therefore,this thesis aims to address the existing problems of the two Transformer-based visual object tracking algorithms,STARK-S50 and Tr Di MP,to make improvements.The main research work is as follows:(1)This thesis represents a visual object tracking algorithm based on convolutional squeezed pyramid Transformer(CSPT).Based on the Transformer tracking framework,the algorithm in this thesis draws on the traditional feature pyramid network idea and integrates it with Transformer to achieve the fusion of multi-resolution features and obtain multi-scale high-level semantic features containing local and global spatial information.The CSPT network mines the scale information in the contextual abstract features of different levels of the target and calculates the multiscale semantic information through the attention mechanism by combining the global dependency of Transformer.The CSPT is used to mine the scale information in the contextual abstract features of the target at different levels,and calculate the association between the multi-scale semantic information through the attention mechanism,and then construct a globally dependent cross-scale high-level semantic feature map.Compared with the baseline algorithm STARK-S50,this thesis takes into account the association between coarse and finegrained features and global contextual semantic information more adequately and further improves the model’s ability to express multi-scale semantic information.Furthermore,this paper proposes a multi-domain spatial channel attention module,which aggregates the output information of both the Transformer encoder and decoder,thus further weakening the feature representation of non-target regions from the spatial domain and channel domain dimensions.Compared with the single-domain attention mechanism used in STARK-S50,the algorithm in this paper can fully take into account the difference of feature information on the spatial domain and channel domain,thus improving the model’s ability to filter important information and fully utilize the feature information.Based on the above improvement work,this paper improves the discriminative ability of the baseline tracking algorithm and achieves more robust and accurate tracking results.(2)This thesis proposes a siamese tracking algorithm based on Transformer and anchor-free network.Since the baseline algorithm Tr Di MP extracts features directly through the backbone network for subsequent processing,the epistemic model constructed in the above way does not fully utilize the feature information containing multi-scale semantics,resulting in weak robustness of the model tracking process.In this thesis,multi-level features are fused to build a robust and accurate feature model to enhance the feature representation capability.Furthermore,the baseline algorithm uses a lightweight Transformer network to build the information integration component,resulting in limited feature interaction and learning capability of the two-way branch.Therefore,in this thesis,the Transformer structure is improved and will be used as a key component to measure the similarity of the siamese branches,and the encoder learns the strong contextual information of the template image features through a self-attention mechanism and sends this information to the decoder to do cross-attention operations with the search region.Second,the Tr Di MP utilizes an anchor-based network search strategy that requires the generation of anchors through a priori knowledge,resulting in inefficient training of the model,poor generalization ability,and the sensitivity of the model to hyperparameters cannot be alleviated.Therefore,this thesis introduces a search strategy based on anchor-free network,which consists of two sub-networks,classification and regression.Among them,the classification sub-network adopts a branch-assisted classification task based on centrality,thus strengthening the sampling points closer to the target center and weakening the edge sampling points;the regression sub-network adopts an anchor-free network,which directly obtains the predicted bounding box by regressing the four distances from the target centroid to the bounding box.Finally,in order to better capture the changes of target appearance and background information,this thesis additionally proposes an online update mechanism based on a hybrid pooling strategy.Based on the above improvement work,this thesis improves the performance level of the baseline tracking algorithm.In this thesis,the proposed object tracking algorithms are verified by a large number of comparison tests and ablation experiments on several benchmark datasets.The experimental results show that the proposed object tracking methods in this thesis can achieve excellent performance. |