| Single visual object tracking is one of the important research directions in the field of computer vision.With the integration of artificial intelligence and various industries,it has already penetrated into the daily life of the public.The single visual object tracking algorithm based on the residual network has been widely used in important fields such as medical imaging,human-computer interaction,automatic driving,traffic flow monitoring,etc.,and has achieved good performance,which has attracted people’s attention.In order to further improve the performance of the single object tracking algorithm,this paper combines the residual network with the Transformer model to improve the single object tracking algorithm.The main work and research contents of this paper are as follows:(1)In view of the problem that the existing single object tracking algorithm has insufficient tracking speed to meet the real-time requirements,this paper proposes a video single object tracking algorithm that combines the Res2 Net residual network and the Transformer model.First of all,in order to improve the multi-scale performance ability of the residual network at a finer-grained level,the algorithm introduces the Res2 Net residual network as the feature extraction network in the backbone network,so as to obtain finer features while increasing each network layer.receptive field.Then use the attention mechanism to obtain global semantic information and better utilize the global information through Transformer’s long-distance dependency properties to fuse deep convolutional features.In addition,the parallel computing feature of the multihead attention mechanism in Transformer can make better use of GPU to improve the training speed.Finally,the frame prediction head module is used to enhance the accuracy of the algorithm through the dot product attention mechanism and the depth cross-correlation operation for feature enhancement.Perform performance comparison experiments with various advanced algorithms on the public La SOT data set.The experimental results show that the FPS of the Transformer model using the Res2 Net residual network for feature extraction reaches 47.These experimental results provide a basis for the research of video single object tracking algorithms.reference.(2)Aiming at the problem that the single object tracking algorithm based on the Res2Net-Transformer model cannot accurately estimate the object state in complex motion scenes,an improved algorithm is proposed in this paper,which aims to improve the accuracy of the algorithm and reduce the amount of parameters and calculations at the same time.The algorithm uses the Res2 Ne Xt residual network formed by the Res2 Net residual network combined with the group convolution structure as the feature extraction network.The Res2 Ne Xt residual network uses the group convolution structure to improve the model ability without increasing the amount of calculation.The number of receptive fields available within can improve the multi-scale representation ability of the model.Then the algorithm uses the global information to fuse the deep convolution features through the Transformer model to further improve the accuracy of the algorithm.The comparative experiments on the La SOT data set and the UAV123 data set show that the algorithm in this paper can effectively improve the success rate and accuracy of object tracking. |