| Video single object tracking is one of the fundamental tasks in the field of computer vision,and has a wide range of applications in intelligent industries such as robot vision,video security,and sports video analysis.Currently,existing single-object tracking techniques have a heavy dependence on visual information from the target and use the initial apparent features of the target to localize the moving target in the following frames,but this form of single-object tracking brings performance bottlenecks in the tracking algorithm and limits the application scope of single-object tracking.On the one hand,the appearance features of the moving target contain rich texture information but only sparse target semantics,resulting in the limited discriminative ability of the single-object tracker for the moving target;on the other hand,the appearance of the target keeps changing during the long-term motion,resulting in the difference from the initial appearance,which makes the single-object tracker unable to stably track the moving target for a long time.As a complement to the visual modal information,the natural language description of the moving target introduces rich semantic information that can be used to improve the target discrimination and long-term tracking capability of the single-object tracker,which is helpful for developing a vision-language dual-modal single-object tracking algorithm.In this paper,we explore the design of vision-language dual-modal single-object tracking algorithms,investigate the form of linguistic information utilization in the single-object tracking task,and propose a vision-language dualmodal tracking algorithm based on local alignment modeling and a long-term tracking algorithm based on dual-modal semantics and motion information.Specifically,the contribution of this paper is as follows:(1)Dual-modal single-object tracking algorithm based on local alignment modeling.In order to fully fuse the visual and linguistic information of the target,the visual features of the target are decomposed into multiple local visual semantics,while multiple local linguistic features are extracted from the natural language description of the target,and the semantically identical local features are correspondingly fused by constructing the correspondence between the local semantic features of the two modalities,so as to reduce the semantic gap between the two modalities and obtain a more powerful dual representation.To this end,this paper designs a new dual-modal tracking framework and proposes a foreground-aware memory module,a part-aware cross-attention module and a vision-language local contrast module.The foreground-aware memory module and the part-aware cross-attention module are used for local decomposition of visual semantics,and the vision-language local contrast module is used to learn the correspondence between the local features of the two modalities and to respectively fuse the features of these local features.(2)Long-term single-object tracking algorithm based on dual-modal semantics and motion information.In order to enable a single-object tracker to track a moving target stably for a long time,it is necessary both to update the appearance information of the target and to relocate the target after the single-object tracker loses the target or the target reappears.To this end,this paper proposes a new global relocation module that uses motion information to sense target disappearance and mislocation and adaptively retrieve the target in the image range.To improve the accuracy of retrieving targets,the module uses vision-language dual-modal information to discriminate the candidate targets from global image range.The re-located target patch is used to update the appearance information of the target so that the single-object tracker can adapt to changes in the appearance. |