Font Size: a A A

Research On Open-Set Object Detection Method Based On Multi-Modal Learning

Posted on:2024-11-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y MaFull Text:PDF
GTID:1528307373971029Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As a critical research direction in the field of computer vision,object detection has a wide range of applications in intelligent transportation,video surveillance,autonomous driving,and other areas.Existing methods typically focus only on the object detection problem in a single set scene,such as cross-domain detection or new class detection,leading to algorithm models being unable to address open-world object detection problems.With the rise of multi-modal learning,artificial intelligence models are beginning to incorporate multi-modal data to solve challenges in open-world settings,thereby greatly enhancing the models’ understanding and cognitive capabilities of the real world.This dissertation proposes multi-modal learning in object detection,utilizing rich category information inherent in image and text data for background modeling or dictionary learning,aiding downstream object detection models in improving detection performance.Specifically,background modeling can effectively alleviate background shift issues in crossdomain detection,while dictionary learning can define unlabeled categories in new class detection.This dissertation conducts corresponding experiments in two scenarios: multiobject recognition in road scenes and monitoring of marine biodiversity,confirming the role of multi-modal learning in alleviating the challenges of cross-domain and new class detection.In particular,the main research contributions of this dissertation are as follows:(1)This dissertation proposes a domain-generalization object detection method based on multi-modal representation alignment,aiming to enhance the model’s capability to detect targets in completely unknown new domains.It presents an object detection method based on text category embeddings,leveraging the uniqueness of text category embeddings to construct a visual feature classifier guided by text semantics.By introducing a multi-modal representation alignment pre-training model,it assists in fully modeling the background for downstream tasks.Additionally,by incorporating representation consistency learning and domain adversarial learning modules,the method improves the model’s learning ability for the source domain,resulting in increased accuracy of target detection in completely unknown domains.(2)This dissertation proposes a foundational model-based open-environment domain adaptation method,aiming to enhance the model’s detection capability in unlabeled new domains.By introducing a vision-language pre-training foundational model to increase category granularity,it improves the model’s ability to adapt to complex data distributions.Through a hierarchical feature alignment strategy,it maps the features of the source domain and target domain to the same semantic space.In multi-source domain and multitarget domain settings,it addresses corresponding challenges through cross-reconstruction and freezing of certain parameters,effectively alleviating knowledge forgetting while also improving the model’s cross-domain target detection accuracy.(3)The dissertation proposes an open-world object detection method that is more aligned with practical applications,while considering cross-domain and new-class detection scenarios.Existing single-stage frameworks are unable to address this challenging task.The proposed method constructs a two-stage training framework,pre-training to build an instance dictionary and establish connections between annotated and unannotated classes to enable the detection of new-class targets in downstream tasks.During the formal training process,domain adversarial training is proposed to further narrow the domain gap,reduce missed detections under scene changes,and enhance the model’s accuracy in detecting new-class targets in cross-domain scenarios.(4)This dissertation presents a novel approach to new-class object detection based on visual-text joint pre-training,aiming to detect unannotated new categories in oceanic datasets.Additionally,it proposes the Marine Det dataset for oceanic scene object detection,comprising over 20,000 images,26 major categories,and 821 subcategories.The dissertation innovatively transfers knowledge learned from land scenes to fine-tune the oceanic data.Detailed comparisons with existing open-vocabulary object detection algorithms and fully supervised algorithms validate the potential of cross-domain data in oceanic tasks,further enhancing the model’s accuracy in detecting new-class targets in oceanic scenes.Finally,this dissertation briefly summarizes the above research content,provides prospects for the future of object detection,and identifies potential directions for further in-depth research,offering new perspectives for future studies.
Keywords/Search Tags:Object detection, cross domain detection, autonomous driving, marine object detection
PDF Full Text Request
Related items