Font Size: a A A

Vision Based Pose Estimation Of Rigid Object

Posted on:2024-09-12Degree:MasterType:Thesis
Country:ChinaCandidate:J H LiuFull Text:PDF
GTID:2568307055957509Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
In applications such as robot navigation,autonomous driving,and virtual/augmented reality,estimating the 6D pose of a target based on RGB information is a crucial step.This problem is defined as locating predefined targets in a reference image and determining their rotational angles and translational vectors relative to the camera system.There are currently two types of solution for this problem: one-stage algorithms and two-stage algorithms.One-stage algorithms directly output a six-dimensional vector based on the input image,which belongs to the end-to-end paradigm and does not rely on other forms of intermediate representation.With the increasing fitting ability of deep neural network,researchers have begun to rely on this ability to perform the above expression retrieval.Two-stage algorithms rely on the 3D structure of the target.Firstly,keypoints are located in the image,and then based on the 2D coordinates of the keypoints and their3 D coordinates in the 3D structure,the projection equation is optimized to complete the pose estimation.In the type of method,keypoints act as an intermediate representation for pose estimation,so the accuracy of the keypoints localization largely determines the accuracy of pose estimation.As for research on keypoint localization,it has undergone a transition from image processing-based solutions to deep learning-based solutions.This article conducts a comprehensive study on the above mentioned two methods regarding the issue,and the main work is as follows:In the camp of two-stage algorithms,research and improvements have been made based on the classic two-stage pose algorithm YOLO-6D.With the aim of improving keypoints localization accuracy,a semantic segmentation branch is added during training.Semantic segmentation is essentially a pixel-level classification task,which can enhance the backbone’s understanding of fine-grained features,which is also essential for keypoint localization.Therefore,according to the principle of multi-task learning,adding a semantic segmentation task can improve the model’s ability to locate keypoints.In addition,the semantic segmentation branch is cut off during inference,so the model’s inference speed is not compromised.Experimental results show that the 2D projection metric and ADD metric on the LINEMOD dataset are 94.43% and 67.37%,respectively,which are better than classic two-stage algorithms.Moreover,the algorithm can run at an inference speed of 73 FPS,which is both highly accurate and practical.In the one-stage algorithm camp,this paper uses the DETR structure to solve the posture of the target.DETR is a work that uses Transformer technology for object detection.It changes the dense prediction approach based on convolution and instead uses the global modeling ability of Transformer to treat object detection as a set prediction problem.It has the advantages of simple process,no need for prior knowledge,and no need for post-processing de-duplication.Based on this,this paper also regards attitude estimation as a set prediction problem and develops a concise onestage algorithm that inherits the aforementioned advantages of DETR.In addition,in response of ‘lack of multi-scale training’ and ‘slow convergence speed’ in DETR,a multi-scale deformable attention mechanism is introduced to solve these problems.Experimental results on the YCB-Video dataset show that the proposed algorithm achieves an 83.0% ADD score,which is higher than other classic single-stage algorithms and has the advantages of a concise process and fast inference speed.Research has been conducted on the estimation of the attitude of non-cooperative spacecraft in the space environment relative to own attitude estimation.During the research process on one-stage and two-stage algorithms,it was discovered that the keypoint localization scheme based on the voting mechanism in the two-stage algorithm camp achieved a generally higher attitude accuracy than other two-stage algorithms and one-stage algorithms.It has better robustness to occlusion and truncation problems.Therefore,considering the improvement of satellite attitude accuracy,a satellite attitude estimation model based on the voting mechanism was developed.In addition,because the distribution of keypoints in the satellite scene is relatively dispersed and the receptive field of the convolution-based voting mechanism is limited,it cannot model the correlation between keypoints.To address this issue,a self-attention mechanism was introduced into voting network,which enhances the global modeling ability of the neural network and improves the localization accuracy of keypoints,and thus the accuracy of attitude estimation.It achieved a score of 0.012 in the Kelvin Attitude Challenge,ranking third among all algorithms.
Keywords/Search Tags:deep learning, pose estimation, YOLO, transformer, uncooperative spacecraft
PDF Full Text Request
Related items