Stereo Matching is a fundamental vision problem in computer vision with direct real-world applications in robotics,3D reconstruction,augmented reality,and autonomous driving.The task is to estimate pixel-wise correspondences of an image pair and generate a displacement map termed disparity which can be converted to depth using the parameters of the stereo camera system.Current state-of-the-art stereo algorithms use a 2D CNN to extract features and then form a cost volume,which is fed into the following cost aggregation and regularization module composed of 2D or 3D CNNs.However,a large amount of high-frequency information like texture,color variation,sharp edge etc.is not well exploited during this process,which leads to relatively blurry and lacking detailed disparity maps.In this paper,we aim at making full use of the high-frequency information from the original image.To handle the problems in the following two scenarios,existing methods need to be improved:(1)Most current approaches fall short when it comes to the finer features of the estimated disparity map.Especially for the edge performance of the objects.In bokeh and rendering applications,the edge performance of the disparity map is critical to the final result.For example,technologies that require pixel-level rendering,such as VR and AR,have high requirements for fitting between the scene model and the image mapping,which means we need a tight fit between the edges in the disparity map and the original RGB image.(2)The mismatch of textureless regions and the missing of thin objects are also important factors that significantly deteriorate the disparity map.For example,the mismatch of weak texture walls and the missing of thin electrical wires are fatal flaws for obstacle avoidance applications.Therefore,EAI-Stereo and DLNR are proposed to improve the performance of current stereo matching networks in weak texture regions and reflective regions,and to solve the problems such as blurry edges and blurry disparity maps.The major contributions of EAI-Stereo can be summarized as follows:(a)We propose an error-aware refinement module that combines left-right warping with learning-based upsampling.By incorporating the original left image that contains more highfrequency information and explicit calculating error maps,our refinement module enables the network to better cope with overexposure,underexposure as well as weak textures and allows the network to learn error correction capabilities which allows EAI-Stereo to produce extreme details and sharp edges.The learning-based upsampling method in the module can provide more refined upsampling results compared to bilinear interpolation.We have carefully studied the impact of the module’s microstructure on performance.From our experiments,we find that the structure improves generalization ability while improving performance.This approach is highly general and can be applied to all models that produce disparity or depth maps.(b)We propose an efficient iterative update module,called Multiscale Wide-LSTM,which can efficiently combine multi-scale information from feature extraction,cost volume,and current state,thus enhancing the information transfer between each iteration.(c)We propose a flexible overall structure that can balance inference speed and accuracy.The tradeoff could be done without retraining the network or even at run time.The number of iterations can also be determined dynamically based on the minimum frame rate.At the time of completing the development,EAI-Stereo ranks 1st on the Middlebury leaderboard and 1st on the ETH3 D Stereo benchmark for 50% quantile metric among all published methods.After proposing the EAI-Stereo,I found that the algorithm still has room for improvements:(1)The data flow of the iterative unit can be further analyzed and improved.(2)The issue that the Refinement Module cannot generalize well when encountering scenarios with very different disparity ranges from the training set could be alleviate through appropriate modifications to the module.(3)The feature extractor becomes bottleneck when the above two aspects have been properly addressed.Therefore,we proposed DLNR(Stereo Matching Network with Decouple LSTM and Normalization Refinement)to alleviate these problems.The main contributions are the following:(a)To further solve the problem of losing high-frequency information,we analyze the data-coupling issue that exist in most of the GRU-based iterative methods and propose the Decouple LSTM.(b)DLNR introduces normalized refinement for scenarios where the disparity range differs significantly from the training dataset.This greatly improves the refinement module to cope with different disparity ranges and enhances the generalization performance of the model.(c)After the above two improvements,we found that the feature extractor became the bottleneck of the performance.In the field of stereo matching,feature extraction has not been improved significantly for years,most learning-based methods still use Res Net-like feature extractors which fall short when providing information for well-designed post-stage structures.To alleviate the problem,we propose the Channel-Attention Transformer feature extractor aims to capture long-range pixel dependencies and preserve high-frequency information.Experiments proved that with the proposed feature extractor,performance in weak texture regions and reflective regions are greatly increased.Our method(DLNR)surpasses EAI-Stereo and ranks 1st on the Middlebury leaderboard,significantly outperforming the next best method(EAI-Stereo)by 13.04%.Our method also ranks 1st on the KITTI-2015 benchmark for D1-fg among all published methods which is important to autonomous driving and robotics navigation and many other applications. |