| Visual object tracking,as one of the important issues in computer vision,aims to estimate and predict the state information of the target in the subsequent frames of the video after defining the state information such as the initial position and size of the target.In recent years,many excellent tracking algorithms have been proposed due to the development of deep learning techniques and computer hardware devices.However,in real scenes,it is still necessary to face complex and variable factors such as occlusion,similarity interference,and target scale change,etc.Therefore,designing an algorithm with high tracking accuracy and good real-time performance remains a challenging issue.This dissertation mainly focuses on the research of single object tracking in open scenes,and explores the issues of tracking accuracy degradation caused by similar background interference,target scale change,and linear correlation loss of semantic information.Based on the Siamese network object tracking,this paper carries out the research work on accurate and robust tracking model based on the principle of attention mechanism,and focus on the construction of effective and rich appearance feature representation.The main work of this dissertation is as follows:(1)A robust visual tracking algorithm with co-attention guided Siamese network(CGS)is proposed to solve the problem of background interference in the search image,which leads to the degradation of tracking accuracy.Based on Siam RPN tracker,a co-attention module is added to learn the interaction between search features and template features,and is used to enhance discriminative features of the template and search regions,reducing the interference of similar background semantic features,and thus suppressing the generation of false correlation results.Meanwhile,a gate mechanism is introduced in template branch to improve the feature representation from the channel dimension,thereby improving the tracking accuracy of the algorithm.(2)A Siamese hierarchical feature fusion Transformer for efficient tracking(Siam HFFT)is proposed to solve the problem of scale change of the object during tracking,especially for small objects.Based on the Siamese network tracking architecture,the algorithm extracts hierarchical features of different scales and semantic information through the backbone network,and detailed structural information is incorporated into visual representations.A novel feature fusion transformer is designed to fuse the underlying spatial information with the high-level semantic information,integrate and optimize multi-level and multi-scale features,and enhance the semantic information and spatial details in the tracking process,which greatly improves the accuracy of the tracking algorithm;at the same time,to avoid the problem of excessive computation during the feature extraction that affects the tracking speed of the algorithm,this algorithm utilizes a lightweight backbone network,which reduces computational complexity and improves the tracking speed of the algorithm.(3)A multi-head cross-attention Transformer for object tracking algorithm(MCTT)is proposed to solve the problem of linear correlation matching problem used by traditional tracking models,which leads to semantic information loss or falling into local optimal solutions.The multi-head cross-attention mechanism is used to learn correlations from different subspaces,and focus on the key features adaptively,in order to mine the global information interaction between the template and the search region at a deeper level.A simple auxiliary mask prediction head is also designed to combine the existing backbone network features and the Transformer encoder-decoder features in a coupled way to obtain high-resolution pixel features and generate masks for more accurate and efficient tracking results. |