| 6D pose estimation has a wide range of applications in military,autonomous driving,and industrial robots.Due to the low cost of the monocular RGB camera and its flexible and convenient application,object pose estimation methods based on RGB images have been widely studied.Methods based on deep learning can be roughly divided into direct methods and 2D-3D correspondence methods.Direct methods can be trained in an end-to-end manner based on 6D pose,but their accuracy is usually limited due to the lack of geometric guidance information.The 2D-3D correspondence methods use the 2D-3D correspondence between the RGB image and the CAD model to improve accuracy,and can be divided into sparse-based and dense-based methods.The dense-based method utilizes CNN to predict lots of 2D-3D correspondences,which has high accuracy and robustness,but the model parameters and calculation overheads are relatively large.The sparse-based network is relatively lightweight,but its accuracy and robustness are limited in scenarios such as occlusion and truncation.Therefore,these methods still face challenges.This paper focuses on the difficulties of pose estimation method for object in RGB image based on deep learning.The main research work includes:1)A Data Field Weighting Based Pixel-wise Voting Network for Effective 6D Pose Estimation(DFW-PVNet)is proposed,which combines the advantages of sparse-based and dense-based methods.First,DFWPVNet segments the mask of the object in the RGB image,and predicts the unit vector field of all pixels in the mask pointing to 2D keypoints,and then takes the unit vector field to vote to 2D keypoints.Based on the data field theory,the influence of pixels in different positions is modeled as potential weight,and then pixels with higher potential weight are selected to participate in the positioning of 2D keypoints,so as to reduce the error interference introduced by faraway pixels.The experimental results demonstrate that the ADD(-S)metric of DFW-PVNet on LINEMOD and Occlusion LINEMOD datasets reached 90.06%and 46.93%,respectively,which is superior to other sparse-based and direct methods,and is comparable to the SOTA dense-based method,GDR-Net.In addition,the amount of network parameters,floating point operations and memory reading and writing of DFW-PVNet is about one-tenth to one-fifth of that of GDR-Net,which makes it more advantageous to deploy on devices with constrained resources such as computing and power consumptions.2)An end-to-end pose estimation method based on dual-channel feature fusion is proposed.This method combines the advantages of sparsebased and direct methods,and specifically designs a Dual-channel Feature Fusion Network for End-to-End 6D Pose Estimation(DFFNet).DFFNet contains two channels,and one channel is used to extract the implicit pose features inside each 2D-3D correspondence,and the other channel is responsible for extracting the implicit topological features between the 2D3D correspondences,and then the features extracted from the two channels are fused,and finally the 6D pose is predicted by an MLP module.DFFNet is a generic module that can be combined with the existing sparse-based method to achieve end-to-end joint training while avoiding the redundant PnP-RANSAC process.The experimental results demonstrate that DFFNet exhibits better accuracy and robustness compared to PnP-RANSAC and the baseline work Single-Stage on Synthetic Sphere dataset with the increase of the number of outliers and noise in the 2D-3D correspondences.The ADD(-S)metric of DFFNet has reached 44.92%on Occlusion LINEMOD dataset,which is 1.62%higher than the baseline work Single-Stage.On the basis of the above research,this paper designs and implements a 6D pose estimation system.Users can use this system to conveniently and effectively estimate the 6D pose of the object. |