| As one of the important researches of computer vision,single object tracking is widely used in weapon guidance,intelligent transportation,video surveillance,human-computer interaction,etc.It is also the basis of cutting-edge research such as multi-object tracking,video detection,and behavior recognition.Therefore,it has attracted the attention of industry and academia.In the practical application.of complex scenes,illumination,motion blur,low resolution,occlusion,target deformation,motion distortion,and other factors affect the key technical modules of single object tracking.Therefore,there are still many problems in complex scenes that need to be solved urgently.This paper focuses on three key problems to improve the key technology of single object tracking in complex scenes.(1)Low visual quality samples,such as blur and low resolution in complex scenes,most directly and significantly bring great difficulties to the mathematical description of target appearance and reduce the robustness of the feature extraction model.(2)In complex scenes,the target’s moving speed changes sharply,or the occlusion targets make the targets disappear in the observation field of vision.Hence,it isn’t easy to locate the target accurately through the appearance.(3)Similar objects in complex scenes and the deformation affect the target discrimnation and identity cognition.To solve these problems,in recent years,researchers have conducted research and improvement on the key technologies such as the feature extraction model,motion model,and observation model in a single object tracking system.However,the existing single object tracking algorithms still have the following limitations:(1)For the low visual quality samples in complex scenes,the current tracking algorithms mainly adopt the methods of integrating traditional manual features and image reconstruction networks with nonlinear mapping.They enrich the texture and contour details of image features but cause the nonlinear change of the original image’s spatial structure and increase the center error of the tracking system.(2)For the target motion distortion problems such as fast motion,rotation,and unreliable motion state,the current tracking algorithms adopt the motion model based on particle filter,dense sampling,Markov chain,or Markov decision model.These methods have no aftereffect and lack the target’s continuous historical motion state.As a result,the predicted search area is inaccurate,and the target is lost.(3)For the problems of target appearance change,such as target deformation,occlusion,and background clutter,the existing methods adopt the Siamese network as the observation model based on the correlation operator or the deep learning classification and regression observation model using convolution operator.These methods ignore the semantic information of the target image,causing the model to fall into local optimization,resulting in the target center.The shape estimation is inaccurate.To solve the above problems and to improve existing methods,this paper deeply studies the key technologies of single object tracking systems in complex scenes,including the feature extraction model,motion model,and observation model.We try to provide theoretical perfection,supplement the current motion model of single object tracking,and explore new methods or frameworks for the feature extraction and observation models.We try to provide unique solutions to the problem of single object tracking in the task of video target perception in complex scenes.The main research work and innovations of this paper are as follows:(1)Aiming at the problem of low visual quality samples and the limitations of current solutions,this paper focus on the perspective of feature extraction network and feature organization framework.This paper proposes a neighborhood-related feature based on a channel attention mechanism with the feature association and a tracking framework of hierarchical image quality and constructs neighborhood topological association based on existing deep learning features.The proposed method effectively enhances the mathematical description of the appearance of blur and low-resolution targets.Firstly,for the feature extraction task of low visual quality images,a neighborhood-related feature extraction network based on a channel attention mechanism is proposed to enhance the image description of the feature extraction model for blur and low-resolution samples.The network introduces the channel attention mechanism to remove the blur image information in the low visual quality image and retain the feature channels that effectively describe the target’s appearance.On this basis,aiming at the target shape estimation task,the network innovatively introduces the 2D RNN layer to construct the adjacent association relationship of image blocks.Through the spatial association relationship between adjacent subblocks of the target image,the dependence of the feature extraction model on a single low visual quality pixel is reduced.For the proposed domainrelated features based on the channel attention mechanism,this paper explores a feature extraction model for the neighborhood topological relationship of pixels,which is an important supplement to the current local feature extraction model.Secondly,aiming at the problem of low-resolution samples in low visual quality scenes and the limitations of current solutions,this paper proposes a feature association method of hierarchical image quality to reduce the target center position deviation caused by the nonlinear mapping of the deep learning network.Combined with the image reconstruction network,we enhance the gradient information of image features.We reduce the nonlinear spatial distortion of the deep neural network,hierarchizing the image features into the high-resolution layer(HR layer)and low-resolution layer(LR layer).Using the spatial structure integrity of the LR layer,this paper makes local spatial constraints on the characteristic ridge regression positioning equation of the HR layer.It uses the rich texture and contour information of the HR layer to improve the accuracy of target shape estimation.The center error is reduced to improve the tracking performance of the single object tracking algorithm in complex scenes.For the proposed feature association framework of hierarchical image quality,this paper attempts to find a new framework of video target analysis combined with image reconstruction,object tracking,and other directions.The proposed methods are tested on OTB-2013,OTB-2015,and UAV123 datasets.The results show that the proposed neighborhood correlation feature extraction network based on channel attention mechanism effectively improves the tracking distance accuracy(7.9%)and tracking success rate(5.6%)of low visual quality targets;The proposed feature association method of hierarchical image quality can improve the tracking accuracy and tracking success rate by more than 2%on the basis of the state-of-the-art tracking algorithm.(2)Aiming at the problems of fast motion and moving targets disappearing in the field of vision in motion distortion scenes and the limitations of current solutions,this paper proposes a Markov chain based on motion state confidence and a Markov decision model on motion state and visual features.Firstly,aiming at the problem of fast motion,a Markov motion model based on confidence is proposed to fit the physical model of target motion,predict the motion direction and speed of the target,and estimate the target’s search area.In this paper,the historical motion state of the target is modeled as each node of the Markov chain.The confidence of each node is evaluated by utilizing the candidate’s appearance,and the motion probability of each node is weighted with confidence to enhance the robustness of the motion model.Secondly,aiming at the problem that the target moves out of view,this paper quantifies the historical motion state of the target.It builds a Markov decision model combined with gradient policy.We use reinforcement learning reward value to enhance the gradient parameter transmission of target motion direction prediction to improve the accuracy of target motion state(direction and speed)prediction.This paper attempts to supplement and improve the current theory and video target motion perception model.The proposed methods are tested on single target general data sets such as DTB70,UAVDT,OTB-2015,and GOT-10k.The results show that the proposed method effectively reduces the central prediction error of the search area(within 5 pixels per frame)and improves the prediction accuracy of the search area(up to 65.4%)and the success rate of tracking(up to 66.3%).(3)Aiming at the problems of target deformation,background clutter,and the limitations of current solutions,this paper proposes an observation model combined with multi-factor information,including a similarity measurement algorithm based on multi-information fusion and a Siamese network observation model combined with self-attention module.Firstly,aiming at the problem of target appearance deformation,the confidence evaluation module of candidate samples is proposed.The combination of spatial information,category information,and appearance information is introduced to evaluate the local correlation between candidate samples and targets to estimate candidate samples’ confidence.Secondly,aiming at the background clutter,the paper proposes a Siamese network combined with self-correlation and cross-correlation modules to enhance the ability of the semantic features and local texture features of the network output to distinguish similar interferences and improve the success rate of object tracking.The method proposed explores a new observation model for the current correlation-based single object tracking algorithm through the organic combination of autocorrelation and cross-correlation.The proposed method is tested on the general data sets of single object tracking such as LaSOT,VOT LT35,and TrackingNet.The results show that the similarity measurement algorithm based on multi-information fusion proposed in this paper can effectively improve the template robustness of deformed targets and improve the tracking accuracy of 4.1%and the tracking success rate of 2.3%on the tracking framework with TransT as the baseline algorithm.The Siamese network combined with autocorrelation and cross-correlation modules proposed in this paper effectively improves the tracking accuracy(63.8%on LaSOT data set)and tracking success rate(68.4%)of the frontier tracking algorithm for deformed targets,and the overall tracking accuracy on otb-2015 data set can reach 93.6%and 71.2%. |