Font Size: a A A

End-to-End Dense Stereo Matching Based On Full Convolutional Neural Network

Posted on:2021-06-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:J H KangFull Text:PDF
GTID:1488306461964099Subject:Photogrammetry and Remote Sensing
Abstract/Summary:PDF Full Text Request
Stereo vision is an important branch of computer vision.By simulating the way human vision obtains three-dimensional information,based on two or more twodimensional images,it can quickly reconstruct the 3D depth information of the scene.Binocular stereo matching technology is a key technology in stereo vision,which purpose is to use binocular cameras to obtain left and right images of the same scene,and then calculate the pixel-wise disparity to obtain the depth information of the 3D scene by finding correspondences between the left and right image.Because of its passive depth sensing,the binocular stereoscopic vision has the advantages of low cost,simple structure and easy implementation,and is widely used in various frontier fields such as automatic driving,robot autonomous navigation,virtual reality,3D reconstruction,etc.It has great commercial value and application value.In recent years,stereo dense matching has been extensively and intensively studied.However,there are still some challenging regions due to various factors.For example,in weakly textured regions or repetitive texture regions,there is a binary problem in stereo matching,which makes the pixels on the left image have multiple matches on the right image.In depth discontinuos area,the parallax continuity assumptions of traditional matching methods can cause excessive smoothing problems in these regions.Also,disparity discontinuities are accompanied by occluded regions,where pixels on the left image are not visible on the right image,making this region more susceptible to mis-matches than other regions.Therefore,how to reduce the mis-matching in this challenge area is the problem that current stereo matching methods urgently need to solve.With the release of some public datasets for stereo matching such as Middlebury,KITTI,etc.,and the rapid improvement of computer hardware performance,many researchers have introduced deep learning techniques for dense stereo matching.the convolutional neural network(CNN)has demonstrated very powerful feature extraction and model expression capabilities for images,which can be applied to various pixellevel vision tasks such as image semantic segmentation,image classification,and image recognition.While dense stereo matching can also be considered as a pixel-level vision processing task,therefore,in this paper,we have investigated the dense stereo matching based on CNN by transforming the traditional stereo matching problem into a disparity learning and optimization problem.The main contributions in this paper are as follows:(1)We have developed and proposed an end-to-end disparity learning network based on the CNN by following the typical steps of traditional stereo matching.The traditional methods typically follow a popular pipeline,which includes fours steps:matching cost calculation,matching cost aggregation,disparity estimation,and disparity refinement.Our end-to-end deep learning network for stereo matching is designed to transform the traditional stereo matching problem into a disparity learning and optimization problem based on CNN,which composes of four modules: feature extraction,cost volume construction,disparity estimation and disparity refinement.Thest four modules are correspoing to the classical steps of traditional methods respectively.Because of the powerful feature extraction and expression capability of the CNN,the CNN can be used to replace the shallow expression of manually extracted features,and to automatically learn features and optimize expression from the data,which provides an interpretable theoretical foundation for the deep learning-based disparity learning method.The experimental results show that our end-to-end disparity learning network proposed in this paper can predict disparity map quickly and accurately and achieve significantly better performance than traditional stereo matching methods and other deep learning methods.(2)We propose a multi-scale feature extraction module based on dilated convolution.In order to improve the matching results in the area of low texture or repetitive texture region,we have introduced the dilated convolution into the feature extraction.Compared with the traditional convolution,the dilated convolution can expand the field of view without increasing the learning parameters.In this paper,we have used the multi-layer parallel dilated convolutions to obtain multi-scale features,which can provide rich contextual information for accurate disparity estimation.The experimental results demonstrate that using the dilated convolution can indeed improve the disparity accuracy especially in low texture or repetitive texture area.(3)We have constructed a cost volume based on two feature maps along a large disparity range.In this paper,we use the left and right image feature maps to construct a three-dimensional cost volume to obtain the correspondence between the left and right images in feature space and provide priori information for subsequent disparity estimation.In the process of constructing the 3D cost volume,we have increased the stride value when calculating the convolution results between the left feature map and the shifted right feature map,which will enable the network to deal with larger disparity range.The experimental results show that our network can predict disparity over a larger disparity range by modification of the cost volume module.(4)We have employed a gradient regularizer to preserve structure details in high depth discontinuos area.In the disparity estimation module of our network,an encodedecode structure is used to recover the multi-scale pixel-level disparity map from coarse to fine by integrating low-level fine information and high-level coarse information together using skip connections.We have introduced the disparity gradient information into the overall loss function,which enables the network not only regress disparity value but alos learn the change of disparity,which will regularize the high depth discontinuity area.The experimental results show that the disparity map predicted by our network using the gradient regularizer in loss function can retain clear disparity sharpe edges and avoid excessive smoothing effect in the depth discontinuity regions.(5)We have proposed a disparity refinement subnetwork to refine and optimize the initial disparity guided by geometric constraints.In the disparity refinement network,firstly,using the shared left and right feature maps from the initial network,a residual cost volume is constructed within the disparity residual range to provide more detailed correspondence information between the left and right image.Secondly,based on the initial disparity map,an reconstruction error volume is constructed between the left and the warped right feature maps,which can reflect correctness information of initial disparity.Finally,using these two volums as input,we use a shallow encoder-decoder structure to learn disparity residual map in different scales.The final refined disparity is calculated by adding the learned residual map with the initial disparity map.The results have shown that the disparity refinement network proposed in this paper can improve the accuracy of the initial disparity map,correct the regions where the initial disparity is wrong,and obtain sub-pixel disparity maps.(6)We have proposed a method to evaluate the generalization ability of our proposed stereo matching network based on different transfer learning strategies.In order to evaluate the generalization ability of our network on different datasets,this paper proposes to employ the ransductive transfer learning and fine-tuning strategies to generalize our model to the realistic street-view dataset and aerial image dataset,respectively.The geralization results show that our end-to-end network proposed in this paper can be applied well to other application scenarios,obtain more accurate results than traditional methods,and has strong generalization ability.
Keywords/Search Tags:Stereo matching, CNN, Dilated Convolution, Gradient regularizer, Disparity refinement, Generalization evaluation
PDF Full Text Request
Related items