Font Size: a A A

Deep Learning Based 3D Perception And Recognition For Robot Vision

Posted on:2020-06-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:M T FengFull Text:PDF
GTID:1368330626456893Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
In recent years,with the rapidly development in the fields of artificial intelligence and5 G communication technology,the intelligent development of robots has encountered new opportunities.How robots can perceive and understand the 3D world through eyes,brains and limbs like human beings and make decisions based on the acquired logical information,has become a research hotspot in academia and industry.We have witnessed the great success of deep learning methods in the fields of image processing,computer vision and pattern recognition.It is of great significance to study the deep learning based 3D visual perception and recognition methods for robots.This thesis will focus on the two core algorithms,i.e.,pixel-level prediction deep convolution network and point cloud deep convolution network,and carry out related works around the applications about human-computer interaction and complex scene understanding for robots.The main research contents of this paper can be summarized as follows:1.The mobile robot platform senses the 3D world through a variety of visual sensors mounted on it,including video cameras,binocular cameras,RGB-D depth cameras,light field cameras and LiDAR.By comparing and analyzing the imaging characteristics of various depth cameras,we chose the light field camera with a new imaging mode.The light field camera has the ability to record the direction of light propagation while acquiring the intensity of light,and has good performance both indoors and outdoors,providing a new idea for the vision system for mobile robots.However,unlike traditional imaging systems,the CMOS main imaging sensor of the light field camera is equipped with a micro-lens array.We conduct a complicated method to calibrate it.For LiDAR,we mainly use the point clouds to record the 3D scenes in the indoor environment,and understand the 3D scenes through the methods proposed in the following.2.An important part of human-computer interaction is the reconstruction and understanding of faces.Reconstructing 3D facial geometry from a single RGB image has recently instigated wide research interest.However,it is still an ill-posed problem and most methods rely on prior models hence undermining the accuracy of the recovered 3D faces.In this paper,we exploit the Epipolar Plane Images(EPI)obtained from light field cameras and learn CNN models that recover horizontal and vertical 3D facial curves from the respective horizontal and vertical EPIs.Our 3D face reconstruction network(FaceLFnet)comprises a densely connected architecture to learn accurate 3D facial curves from low resolution EPIs.To train the proposed FaceLFnets from scratch,we synthesize photo-realistic light field images from 3D facial scans.The curve by curve 3D face estimation approach allows the networks to learn from only 14 K images of 80 identities,which still comprises over 11 Million EPIs/curves.The estimated facial curves are merged into a single pointcloud to which a surface is fitted to get the final 3D face.Our method is model-free,requires only a few training samples to learn FaceLFnet and can reconstruct 3D faces with high accuracy from single light field images under varying poses,expressions and lighting conditions.Comparison on the BU-3DFE and BU-4DFE datasets show that our method reduces reconstruction errors by over 20% compared to recent state of the art.3.Predicting depth maps from the RGB images can help the robots to avoid obstacles and plan paths.Convolutional Neural Networks(CNN)have performed extremely well for many image analysis tasks.However,supervised training of deep CNN architectures requires huge amounts of labelled data which is unavailable for light field images.In this paper,we leverage on synthetic light field images and propose a two stream CNN network that learns to estimate the disparities of multiple correlated neighbourhood pixels from their Epipolar Plane Images(EPI).Since the EPIs are unrelated except at their intersection,a two stream network is proposed to learn convolution weights individually for the EPIs and then combine the outputs of the two streams for disparity estimation.The CNN estimated disparity map is then refined using the central RGB light field image as a prior in a variational technique.We also propose a new real world dataset comprising light field images of 19 objects captured with the Lytro Illum camera in outdoor scenes and their corresponding 3D pointclouds,as ground truth,captured with the 3dMD scanner.This dataset will be made public to allow more precise 3D pointcloud level comparison of algorithms in the future which is currently not possible.Experiments on the synthetic and real world datasets show that our algorithm outperforms existing state-of-the-art for depth estimation from light field images.4.The semantic segmentation of 3D point cloud for mobile robots is the key to realize the understanding of complex scenes.Convolutional Neural Networks(CNNs)have performed extremely well on data represented by regularly arranged grids such as images.However,directly leveraging the classic convolution kernels or parameter sharing mechanisms on sparse 3D point clouds is inefficient due to their irregular and unordered nature.We propose a point attention network that learns rich local shape features and their contextual correlations for 3D point cloud semantic segmentation.Since the geometric distribution of the neighboring points is invariant to the point ordering,we propose a Local AttentionEdge Convolution(LAE-Conv)to construct a local graph based on the neighborhood points searched in multi-directions.We assign attention coefficients to each edge and then aggregate the point features as a weighted sum of its neighbors.The learned LAE-Conv layer features are then given to a point-wise spatial attention module to generate an interdependency matrix of all points regardless of their distances,which captures long-range spatial contextual features contributing to more precise semantic information.The proposed point attention network consists of an encoder and decoder which,together with the LAE-Conv layers and the point-wise spatial attention modules,make it an end-to-end trainable network for predicting dense labels for 3D point cloud segmentation.Experiments on challenging benchmarks of 3D point clouds show that our algorithm can perform at par or better(about1.2%)than the existing state-of-the-art methods.5.The mobile robot first needs to detect objects in the 3D point cloud scene during the execution of tasks such as grabbing.Convolutional Neural Networks(CNNs)have emerged as a powerful strategy for most object detection tasks on 2D images.However,their power has not been fully released on detecting 3D object from point clouds directly without converting them to regular grids.Besides,all state-of-art 3D object detection methods rely on recognize 3D objects individually without exploiting their relations during learning and inferring.In this paper,we first introduce a strategy that associates the predictions of direction vectors and pseudo centers together to lead to a win-win solution for 3D bounding box candidates regression.Then we propose a point attention pooling method to extract uniform appearance features for each 3D proposal,which are profit from the learned direction features,semantic features and spatial coordinates of object surface points.Meanwhile,the appearance features are used together with position features to build 3D object-object relation graphs for all proposals simultaneously,thus allowing modeling of their interactions.Specifically,we explored the effect of relation graphs on proposals' appearance features enhancement under unsupervised and supervised conditions.The proposed relation graph network consists of a 3D object proposal generation module and a 3D relation module,makes it an end-to-end trainable network for detecting 3D object in point clouds.Experiments on challenging benchmarks(SunRGB-D and ScanNet datasets)of 3D point clouds show that our algorithm can perform better(about 1.5%)than the existing state-of-the-art methods.
Keywords/Search Tags:Robots, Deep Learning, 3D Vision, Light Field Cameras, Depth Estimation, 3D Face Reconstruction, 3D Semantic Segmentation, 3D Object Detection
PDF Full Text Request
Related items