Monocular 3D object detection aims to classify and localize objects in monocular images,which has been widely used in the fields of autonomous driving,intelligent robots,and virtual reality.However,monocular images lack directly usable depth information,making monocular 3D object detection a challenging task.How to make full use of the potential depth information in the image to assist 3D object detection is a meaningful research topic.Besides,due to the high cost and difficulty of labeling 3D boxes of large-scale real scenes,the existing supervised methods are difficult to be applied to unlabeled real scenes.Considering the convenience of obtaining 3D box annotations of synthetic data,the model trained on the labeled synthetic domain can be transferred to the unlabeled real domain through unsupervised domain adaptation.In this way,the problem of lacking annotations is effectively solved,so as to perform unsupervised 3D object detection in the real domain.To this end,by analyzing the key issues of monocular 3D object detection,this thesis researches on deep learning-based monocular 3D object detection.This thesis proposes an RGB-D joint monocular 3D object detection method.The depth map estimated from the RGB image is utilized as the assistance to provide the key depth and localization information for the RGB image at the feature extraction level and the 3D object detection level,so as to accurately perceive the position and size of the object.Specifically,a consistency-aware joint detection mechanism is proposed to jointly detect objects in the image,and exploit the localization information in the depth detection stream to optimize the detection results in the RGB image.Then,to reserve more accurate 3D bounding boxes,an orientation-embedded Non-Maximum Suppression(NMS)is designed in the post-processing stage.By introducing the orientation confidence prediction and embedding the orientation confidence into the traditional NMS,the localization quality of the predicted 3D bounding box is more comprehensively measured.Experiments on the widely used dataset demonstrate that the proposed method has achieved relatively excellent detection performance.This thesis also presents an unsupervised domain adaption method for monocular3 D object detection.Due to the domain shifts between the synthetic domain and the real domain,specific domain adaption strategies are designed to reduce the domain shifts between the synthetic domain and the real domain.Specifically,to reduce the influence of domain shift caused by the perspective projection process,the proposed method utilizes a depth estimation network to generate the depth map and converts it into the pseudo point cloud for 3D object detection.Meanwhile,to reduce the domain shift in image style,a style transfer module is introduced in the depth estimation stage to obtain the high-quality depth map for the pseudo point cloud generation.Furthermore,a crossdomain point cloud size matching module is introduced in the point-based 3D object detection stage,and the size of the synthetic point cloud is adjusted to adapt to the real object size,thereby improving the detection performance of objects in the real domain.Experiments on the synthetic dataset and the real dataset demonstrate that the proposed method achieves superior performance on cross-domain 3D object detection. |