Simultaneous localization and mapping(SLAM)is an important research issue in the field of robotics.It realizes the task of robot carrying cameras and lidar sensors to locate itself in unknown environment and build a map at the same time.In recent years,with the development of augmented reality and autopilot applications,visual Slam(visual SLAM)has attracted extensive attention.V - Slam uses image as the main perceptual information source,estimates camera pose and constructs 3D scene through multi view geometry theory.Visual odometry is the most important part of v SLAM,which is the basic work of self positioning and map drawing.Recent work shows that unlabeled monocular video can be used to train convolutional neural networks for depth prediction and self motion estimation.However,due to the lack of appropriate constraints,the output scale of the network on different samples is inconsistent,that is,due to the ambiguity of each frame scale,the self motion network can not provide a complete camera trajectory on a long video sequence.In this paper,an end-to-end visual odometer network is designed based on deep learning method.The components are camera self motion estimation network and image depth prediction network.The principle is that one image is transformed into another by the predicted depth and self motion,and the network is trained by using the image reconstruction loss as the supervision signal,It can also be trained only on monocular video.When training on monocular video,the whole network is totally unsupervised.In the traditional visual range localization method,the loop detection of training video itself is often used to assist the calibration of self motion.But in the deep learning method,the loop detection of long video becomes difficult to achieve.Therefore,this paper proposes a method of forming a loop in the relevant frames within 10 frames and adding the supervision signal to the mismatch between self motion and loop to assist training.This idea solves the problem that the scale of depth prediction network is inconsistent with that of self motion estimation network.The test results on Kitti dataset show that the accuracy of self motion estimation is greatly improved by adding local loop detection.Our visual range accuracy is as competitive as the latest models trained with stereo video.Finally,inspired by CNN’s ability to extract relative depth information from monocular images,an object size estimation network is proposed to estimate the actual size of objects in monocular images,so as to obtain the corresponding relationship between real scale and image pixels.By comparing with the previously trained depth prediction network,the ratio between the depth prediction results and the real scale is obtained.Furthermore,this ratio is applied to the self motion trajectory to predict the self motion trajectory with real scale from the monocular video sequence. |