Font Size: a A A

Design And Implementation Of Stereo Vision System Based On Attention Mechanism And Matching Measure Learning

Posted on:2022-10-21Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhangFull Text:PDF
GTID:2518306338985969Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Binocular stereo matching is a basic and challenging task in the field of computer vision.This task has a wide range of applications in autonomous driving,dense reconstruction and other depth-related scenes.Precise semantic context information can provide regional support for stereo matching tasks,the same semantics in stereo matching means that the feature points are similar in depth.By aggregating feature information similar in depth to the current pixel point,the distinguishability of its feature representation can be enhanced.Highly distinguishable features can effectively reduce the chance of mismatching,which is essential to achieve accurate matching in ill-conditioned areas such as occlusion and weak texture.At the same time,it is also a key research direction to design a suitable matching measure to construct a Cost volume to make full use of the rich feature information learned by ntework.Traditional methods mostly use some hand-crafted algorithms to obtain semantic context information,but these methods usually only applicable to certain specific scenarios,so there are serious generalization restrictions.The emergence of convolutional neural networks has greatly improved the ability to learn feature representation.At the same time,many methods based on multi-scale feature aggregation to capture semantic context information have proposed.These methods have better performance than traditional methods,but the fixed size and shape convolution and pooling operations still limit the network's ability to perform geometric transformations,resulting in insufficiently accurate and comprehensive context information.In this paper,we first propose a deformable self-attention stereo matching network to capture accurate global contextual information.The self-attention module captures the globle context information by adaptively aggregating depth-similar features in the global scope to enhance the feature representation of the spatial dimension.We further perform deformable convolution operations on the features processed by the self-attention mechanism to improve the network ability of processing complex deformations,which helps to refine the discriminative representation of pixels in the boundary area and reduce boundary parallax blur and coupling problems.Although the self-attention mechanism model has high accuracy,it consumes a lot of computing resources.Aiming at this drawback,this paper proposes a variety of self-attention mechanism variants to optimize time efficiency.After obtaining the rich feature information,the cost volume needs to be constructed by calculating the matching cost between the pair of feature points to be matched,and the cost volume is further processed to obtain the final disparity estimation result.The traditional methods based on cross-correlation,mutual information and Census transformation to calculate the matching cost have achieved good results,but it is difficult to directly transfer these methods to neural networks to achieve the desired results.Most of the current binocular stereo matching algorithms based on neural networks use a learnable matching measure to construct the Cost volume,and then regularize it through 3D convolution to fit the cost aggregation operation in the traditional method.This paper draws on this idea and proposes a variety of matching measures based on learning to improve the disadvantages of the current algorithm.At the same time,by combining different semantic context information extraction modules,different network architectures can be formed to meet the needs of different application scenarios.The experimental results on the SceneFlow and KITTI testing sets show that the best model of the binocular stereo matching network based on the deformable attention mechanism proposed in this paper has a significant improvement in accuracy on both data sets.This model surpasses many excellent network architectures of similar research directions in recent years in terms of accuracy indicators.
Keywords/Search Tags:Stereo matching, Semantic context information, Attention mechanism, Matching measure learning
PDF Full Text Request
Related items