| In the era of the rapid development of artificial intelligence,there is an increasing demand for services such as intelligent robots,autonomous driving,indoor navigation,etc.,which has led researchers to conduct in-depth research in these areas.These fields have a common basic problem--how to localize themselves more accurately.CNN has good performance in camera localization,but it still has the problems of low accuracy and high error rate.One of the important reasons is the unified processing of the two different parameters of position and orientation.This paper proposes two end-to-end methods based on deep learning to regress the positions and orientations of the camera from color images.The main work and contributions of this article are summarized as follows:(1)A dual-stream encoder-decoder localization network(DSEDL-Net)is proposed.The design of the dual-stream structure decouples the position and orientation and solves the turbulence problem between the two.Because of the different characteristics of camera position and orientation,the network leverages the multi-task concept to predict the position and orientation separately using a dual-stream structure,thus obtaining more reliable results.We proposed a camera pose regressor using single-scale downsampling module or multi-scale aggregation module to transform the decoded features,and use the global average pooling operation to capture the spatial information of the features and reduce the information loss.(2)A scene localization network based on joint task learning(JTL-Loc Net)is proposed.DSEDL-Net completely decouples the position and orientation,but the two are not completely isolated,so JTL-Loc Net introduces the gating module of the attention mechanism which selects and transmits the information that needs to be focused on for different tasks,and this information is also a global feature that overcomes the shortcomings of the locality of convolution operations in convolutional networks and allows information to be shared between different tasks;In addition,JTL-Loc Net adds auxiliary task branches on the basis of DSEDL-Net,which improves network performance.Auxiliary task branches(such as crop coordinates,rotation angle,or scaling factor)are embedded after the position decoder.For small-scale data sets,auxiliary tasks can be regarded as a regularization term in the network,which provides a priori knowledge by adding constraints to reduce the hypothesis space and accelerate the convergence of the network.(3)A large number of experiments on challenging public indoor and outdoor scene datasets prove the effectiveness of the proposed method.On the indoor Microsoft 7-Scenes dataset,the average position and orientation errors of DSEDL-Net compared to the "Pose Net" method are reduced by 47.7% and 21.5% respectively.compared with the "LSTM-Pose" method,the average position and orientation error of JTL-Loc Net are reduced by 32.3% and 36.5% respectively.On the outdoor Cambridge Landmarks dataset,the average pose error of proposed JTL-Loc Net was reduced by 44% and 64% compared to "Pose Net".In summary,the two networks proposed in this paper have achieved good results on open indoor and outdoor datasets,proving the feasibility and effectiveness of the method proposed in this paper for multi-view 3D scene positioning tasks. |