| Object tracking refers to an initial state of a given target and then estimates the state at each moment in the video sequence.It has a wide application value in video monitoring,auto-driving,military,human-computer interaction and so on.However,because it is facing many challenges such as occlusion,deformation,rotation,scale change,and so on,it has attracted the attention of many scholars and is a hot spot in the field of computer vision.In recent years,some progress has been made in the research of visual tracking based on deep learning.Although these achievements have achieved good results in accuracy,real-time performance and robustness,there are still problems such as the neglect of time information features,the inadequate attention mechanism and Transformer structure,the lack of combination of local and global features of model network,and the poor robustness of online tracking.For this reason,based on the framework of Siamese network,this dissertation uses convolution neural network and attention mechanism to research the object tracking theory and algorithm in spatiotemporal dimension,and the following innovative results are obtained.First,a Siamese network visual tracking method based on spatiotemporal and attention is proposed.This algorithm uses Siamese region proposal network as the baseline structure,and improves the original two-dimensional convolution network of the backbone network to three-dimensional structure.This can increase the time dimension,extract the motion information of continuous video,and solve the problem that time features are easily ignored.In order to highlight important spatiotemporal information,the algorithm uses cascade attention module to enhance the significant features of space,channel and time,respectively,for background suppression and foreground prominence.This method improves the algorithm model through the attention mechanism.Second,a Transformer spatiotemporal fusion visual tracking method based on dual attention is proposed.First of all,unlike previous trackers that only use convolutional neural networks,this algorithm uses three-dimensional convolution to extract spatiotemporal information in the backbone network.In addition,it introduces a Transformer within Siamese network framework,constructs a dual attention spatiotemporal fusion Transformer module,establishes global long-range nonlinear relationships,and captures more target background related information.Then,in order to adapt the Transformer to this spatiotemporal attribute of tracking,the algorithm improves the Transformer structure,establishing feature map connections between multiple frames through the temporal attention module,and obtaining temporal motion information;At the same time,the spatial attention module calculates the correlation between the template and the search feature,distinguishing foreground from background.Finally,by stacking three scales of dual attention spatiotemporal fusion Transformer modules and overlaying spatiotemporal information through the fusion layer,the local to global feature information is effectively fused.This algorithm dynamically updates some templates,adding update changes while focusing on the first frame template,solving the problem of poor robustness caused by using only the initial or dynamic frames of the template.Third,a Transformer based visual tracking method for spatiotemporal key region is proposed.Aiming at the low efficiency of global computation of Transformer input sequences,which can neither effectively focus on important sequences nor effectively focus on key target region,the algorithm first constructs a Transformer based key region extraction module to refine the template and search for feature sequences,select a small number of sequences with high response values,and distinguish the target from the background through cross correlation calculations.Then,in order to reduce parameters,the algorithm directly uses Transformer as the backbone network to extract multi frame video features,and superimposes the feature map to fuse spatiotemporal information.Finally,in order to improve the accuracy of boundary box estimation,a central corner prediction method is designed to reduce the interference of edge independent information.Fourth,A deformable Transformer and spatiotemporal visual tracking method is proposed.Original Transformer is characterized by redundancy of structural features,susceptibility to extraneous information outside the region of interest,neglect of the fusion of local and global features,and lack of spatial and temporal information extraction.First,the algorithm introduces a deformable attention module,which selects the corresponding position of a sequence in a data-dependent manner to obtain more effective features and reduce redundancy.Then,a Siamese network tracker with two branches,template branch and search branch,is constructed.The template branch extracts feature through a two-dimensional convolution neural network and establishes a non-linear global relationship through a self-attention module.Search branch extracts spatiotemporal features through a three-dimensional convolution neural network,and makes important spatiotemporal information prominent through a spatiotemporal fusion module.Finally,the template and search branch features are correlated through the cross-attention module to establish the correlation between the target and the background.Because the algorithm uses both convolution neural network and Transformer,it not only makes the network model local,but also global,spatiotemporal,and updates the Transformer structure.In addition,this dissertation evaluates and analyzes the proposed methods on the well-known public datasets,and compares them quantitatively and qualitatively with the current excellent algorithms.By analyzing the experimental results,we can find that the proposed methods are effective,they can effectively alleviate the above critical problems,and improve the overall performance of the visual object tracking algorithm to achieve real-time tracking. |