Font Size: a A A

Research On Methods For Focal Target Localization And Tracking Based On Cross-modal Features

Posted on:2024-12-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:W J ZhangFull Text:PDF
GTID:1528307370971029Subject:Security engineering
Abstract/Summary:PDF Full Text Request
The widespread deployment and interconnection of video surveillance systems have rendered large-scale surveillance video data as a crucial asset for law enforcement agencies in the detection,prevention,and deterrence of criminal activities.The utilization of intelligent video analysis technology for vast video data significantly enhances the effectiveness of law enforcement operations.Target localization and tracking,as pivotal technology within intelligent video analysis,enables the accurate location and consistent tracking of targets in surveillance videos,thereby generating movement trajectories.This assists law enforcement personnel in the identification,verification,and tracking of suspects amidst the vast amount of surveillance footage,thereby enhancing the efficiency of case resolution.Consequently,object tracking technology holds immense significance in research as well as real-world applications.Given the rapid advancements in computer technology and deep learning,methods for target localization and tracking in general scenarios have shown remarkable performance gains across diverse datasets.However,in practical applications,they encounter challenges such as acquiring known target images,handling target occlusions,dealing with appearance distortion,managing disappearance and reappearance,and addressing significant variations in apparent features of the same target across different cameras.Algorithms that rely solely on visual information for appearance modeling face difficulties in achieving precise target localization,stable tracking,and accurate cross-camera matching.To address these issues,this work presents a study on methods for target localization and tracking that are based on cross-modal features.The aim is to enhance the precision and stability of target localization and tracking in surveillance videos.The main research contents and innovations are as follows:(1)An approach for referring image segmentation is proposed,leveraging cross-modal attention guidance and visual reasoning.This approach aims to address challenges associated with inaccurate target localization and inadequate multi-modal feature discrimination.Initially,a dualbranch multi-scale target perception module is proposed in the feature enhancement phase.This module is to capture both local and global multi-scale detail information of the target,thereby enhancing the model’s ability to perceive objects of varying sizes.Subsequently,a feature fusion module,guided by cross-modal attention,is constructed in the feature fusion stage.This module integrates language features with local and global visual features,resulting in representations that capture both local and global multi-modal features.Additionally,a language-guided visual relationship graph is established in the visual reasoning stage.This graph leverages relational context from language descriptions,coupled with graph convolution operations for visual reasoning.This approach enhances the discrimination of multi-modal features.Experimental results obtained on the UNC,UNC+,G-Ref,and Refer It datasets demonstrate the effectiveness of the proposed method.Specifically,it achieves Overall Io U scores of 67.87%,58.49%,53.47%,and 68.12%,respectively.These scores represent improvements of 2.18%,3.25%,2.54%,and1.32% over state-of-the-art methods.(2)Another referring image segmentation technique rooted in mutual visual-language guidance is proposed.This addresses the challenge of imprecise target localization resulting from limited deep interaction between visual and language features.Initially,a visual-guided language query encoder is presented,which seamlessly integrates key visual cues into language information.This fusion generates language queries enriched with target-related visual cues.Subsequently,a context encoding module is constructed,which performs visual context encoding,guided by both intra-modal and cross-modal interaction between visual and language modalities.Additionally,a semantic-guided enhancement module is developed that harnesses the abundant semantic information in multi-modal features.This module guides the fusion of shallow visual features with multi-modal features,which enhances the recognition of target semantic boundaries.Evaluation on the UNC,UNC+,and G-Ref datasets demonstrates the effectiveness of our method.It achieves Overall Io U scores of 75.91%,68.51%,and 62.08%,respectively,outperforming state-of-the-art methods.(3)A sophisticated object tracking approach rooted in a cross-modal feature encoding fusion strategy is proposed.This addresses the challenges associated with inconsistent target tracking,which arise due to the absence of profound visual-language feature fusion and inadequate temporal target perception.Firstly,in the feature encoding phase,we simultaneously conduct information interactions between template and search features,alongside cross-modal exchanges between language and visual features.These interactions undergo dense optimization across multiple encoding stages,enabling multi-stage,deep fusion of visual and language features,thereby augmenting target perception.Subsequently,we construct a temporal information mining module,guided by global language information.This module effectively extracts temporal information from historical target features,enhancing target tracking performance across diverse scenarios,including target deformation and reappearance.Extensive experiments conducted on the La SOT,OTB99,and TNL2 K datasets demonstrate the superiority of our approach.Specifically,it achieves tracking accuracy rates of 65.7%,70.7%,and 62.7%,surpassing various prevalent methods.Furthermore,performance tests on two custom-built datasets validate the efficacy of our proposed method.(4)An approach for cross-camera object re-identification is proposed,which addresses the challenges associated with accurately matching the same target across different view-fields due to significant appearance disparities.Initially,a fine-grained feature learning module is developed,one that leverages the reverse denoising process of diffusion models.This module aims to capture feature representations enriched with text-image fine-grained matching information.Subsequently,a semantic consistency alignment strategy is we introduced to ensure that the finegrained matching information remains relevant to the input image-text information.Additionally,a local-to-global cross-modal matching module is constructed,facilitating the propagation of finegrained matching information to the visual-language context.Evaluation on the CUHK-PEDES,ICFG-PEDES,and RSTPReid datasets demonstrates the effectiveness of our proposed method.Specifically,we achieve Rank-1 scores of 75.67%,65.26%,and 65.67%,outperforming stateof-the-art methods by 2.29%,1.80%,and 5.47% respectively.These results validate the superiority of our approach in addressing cross-camera object tracking challenges.
Keywords/Search Tags:Video surveillance, Cross-modal features, Attention mechanism, Target localization, Object tracking
PDF Full Text Request
Related items