| Object detection and localization is a classic problem in computer vision and one of the fundamental and important components of computer vision.Since the 1970s,it has attracted great enthusiasm from researchers.As a hot topic in academia and industry,object localization uses core artificial intelligence technology to provide strong support for advanced applications such as facial recognition,intelligent monitoring,and autonomous driving.Object localization aims to extract and recognize the objects of interest in the images.However,most existing methods are based on fully supervised learning,which means that models are trained using predefined object categories and bounding boxes.This process requires a large amount of fine-grained data annotations,severely limiting the applicability of object localization.To alleviate the reliance on bounding box annotations,weakly supervised object localization methods have emerged.Unlike fully supervised methods,weakly supervised methods only require rough image-level annotations,which significantly reduces the dependence on manual annotations.To address the issue of insufficient supervision,this thesis conducts systematical studies of weakly supervised object localization at two levels of granularity:region-level supervision and pixel-level supervision.It progresses from solely having image-level labels to incorporating region-level supervision,then introducing pixel-level pseudo-labels,and finally incorporating high-quality pixel-level pseudo-labels.For region-level supervision,existing methods are hard to locate nonsalient object regions,and often lead to a decrease in classification performance while achieving localization performance improvement.For pixel-level supervision,current methods lack pixel-level semantic annotations,and generate pixel-level pseudo-labels with ambiguous activation values.This thesis conducts in-depth research on these core challenges,with the main innovations and contributions summarized as follows:·A novel diverse robust part mining algorithm is proposed for weakly supervised object localization at region-level supervision granularity.To deal with the challenge of locating non-salient object regions,three effective mechanisms are designed to model diverse and robust object parts,including diversity,compactness,and importance learning mechanisms.The diversity mechanism encourages the model to focus on multiple salient regions so as to activate non-salient regions;The compactness mechanism aims to learn robust feature representations to obtain more powerful part detectors;The importance mechanism models the importance of object parts,and thus part-aware classification and localization results can be adaptively fused to obtain final results.Extensive experimental results on two datasets demonstrate that the proposed algorithm can activate non-salient regions and improve the completeness of object localization.·An algorithm based on foreground activation maps is proposed for weakly supervised object localization at region-level supervision granularity.To address the challenge of classification performance degradation given by the improvement of localization performance,this model proposes to model classification and localization tasks jointly,and there are two core modules:the object-aware attention module and the part-aware attention module to achieve this goal.Unlike the aforementioned diverse complementary part mining algorithm,which often leads to a decline in classification performance while improving localization performance,the proposed object-aware attention module can generate class-agnostic foreground maps to achieve complete object localization,and the part-aware attention module can select the most discriminative parts for accurate object classification,achieving remarkable performance for both tasks.Results on two standard datasets show that the proposed algorithm achieves significantly better localization and classification accuracy than most existing methods.·An algorithm based on adversarial transformers is designed for weakly supervised object localization at pixel-level supervision granularity.To address the challenge of lacking pixel-level semantic annotations,this algorithm utilizes self-supervised learning to provide pixel-level pseudo-labels.To guide the learning of the localization model effectively,two core modules are proposed in this algorithm:object transformer and part transformer.The object transformer is designed to generate localization maps for the input images,while the part transformer is to accurately discriminate the differences between localization maps and pseudo labels.The two modules are trained in an adversarial manner to obtain a well-learned localization model.Experimental results on two datasets show that this algorithm can significantly improve the accuracy of object localization.·An algorithm based on a task-aware transformer is proposed for weakly supervised object localization at pixel-level supervision granularity.To deal with the challenges of pixel-level pseudo-labels with ambiguous activation values,three core modules are designed in the algorithm,including a representation encoder,a localization decoder,and a classification decoder.The representation encoder models the global context to learn robust features that effectively represent object appearance.The localization decoder adopts an optimal transport algorithm to generate pixel-level labels with binary values to online refine localization results.The classification decoder uses class-aware prototypes to enhance classdiscriminative features for accurate object classification.Extensive experimental results on two datasets show that our algorithm can significantly improve localization and classification performances.Overall,this thesis systematically conducts systematical studies of weakly supervised object localization at both region-level supervision and pixel-level supervision.From the perspective of research methods,to deal with the challenge of locating nonsalient object regions and classification performance degradation given by the improvement of localization performance,the research is carried out to mine diverse and robust object parts,and model classification and localization tasks jointly;To deal with the challenge of lacking pixel-level semantic annotations and pixel-level pseudo-labels with ambiguous activation values,this thesis uses self-supervised learning to provide pixellevel pseudo-labels,and adopts an optimal transport algorithm to generate pixel-level labels with binary values to improve the quality of pseudo-labels. |