Font Size: a A A

Model Learning,Compression,and Integration For Robust Visual Object Tracking

Posted on:2022-01-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:N WangFull Text:PDF
GTID:1488306323462874Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Visual object tracking is a fundamental task in computer vision.Given the initial target state,a visual tracker requires to locate the target object in successive frames.In recent years,deep learning based visual tracking has made significant progress.Nev-ertheless,deep learning techniques also bring the issues such as expensive annotation cost,high model complexity,and limited tracking efficiency.To release the potential of deep visual tracking,this thesis focuses on three aspects including model learning,model compression,and model integration.The main contributions of this thesis are three-fold:In model learning,this thesis investigates how to alleviate the training cost of deep trackers and how to leverage the temporal information resides in the tracking videos.First,to alleviate the expensive data labeling and high training cost in deep visual track-ing,this thesis presents an unsupervised tracking framework based on the forward and backward tracking trajectory analysis.The motivation of unsupervised learning is that a robust tracker should be effective in bidirectional tracking.In the training process,the proposed algorithm measures the consistency between forward and backward trajecto-ries to learn a robust tracker from scratch merely using unlabeled videos.The proposed unsupervised tracker exhibits the baseline accuracy of classic fully supervised trackers while achieving a real-time speed.Next,to exploit the rich temporal information in the video flow,this thesis introduces the transformer architecture to the visual track-ing community.The transformer encoder-decoder structure tightly bridges the isolated video frames to propagate rich temporal cues(e.g.,target features and attention masks)across frames.By virtue of the proposed transformer,existing trackers gain substantial performance improvements and achieve state-of-the-art accuracy.In model compression,to tackle the high computational complexity,huge model parameters,and unsatisfactory tracking efficiency of deep tracking,this thesis proposes to jointly compress and transfer the heavyweight tracking models.This work formulates a CNN model pretrained from the image classification task as a teacher network,and distills this teacher network into a lightweight student network as the feature extractor to speed up correlation filter trackers.In the distillation process,this thesis proposes a fidelity loss to enable the student network to maintain the representation capability of the teacher network,and designs a tracking loss to adapt the objective of the student network from object recognition to visual tracking.Extensive experiments on standard datasets demonstrate that the lightweight student network accelerates the speed of state-of-the-art deep trackers to real-time on a single-core CPU while maintaining almost the same tracking accuracy.In model integration,this thesis investigates how to assemble multiple deep track-ers to achieve the model complementation.This thesis proposes two ensemble algo-rithms including a multi-cue analysis strategy and a policy-based switch framework.Multi-cue analysis framework constructs multiple experts through correlation filter and each of them tracks the target independently.With the proposed robustness evaluation strategy,the suitable expert is selected for tracking in each frame.Furthermore,the di-vergence of multiple experts reveals the reliability of the current tracking,which is quan-tified to update the experts adaptively to keep them from corruption.With the proposed multi-cue analysis,tracking performance is significantly improved.The policy-based switch framework aims to retain the performance advantage of the ensemble frame-work without sacrificing tracking efficiency.This algorithm consists of multiple weak but complementary experts and an agent network.By formulating this expert switch in consecutive frames as a decision-making problem,this approach learns an agent via re-inforcement learning to directly decide which expert to handle the current frame without running others,which greatly ensures the overall tracking efficiency.Extensive exper-iments verify the effectiveness of the proposed method.
Keywords/Search Tags:Visual object tracking, Unsupervised learning, Model compression, Ensemble learning, Correlation filter, Siamese network
PDF Full Text Request
Related items