Font Size: a A A

Research On Self-supervised Monocular Scene Flow Estimation Method Base On Transformer

Posted on:2024-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:J YangFull Text:PDF
GTID:2568306941497214Subject:Electronic information
Abstract/Summary:
Artificial Intelligence(AI)is based on the perception of the motion object in the 3D scene,in which optical flow estimation can predict the direction and speed of object motion by calculating the correlation between adjacent frames of input images,and depth estimation can predict the distance from the target object to the camera source.Therefore,scene flow estimation combing optical flow and depth has been widely used in artificial intelligence processing and machine vision.Scene flow estimation is divided into binocular scene flow estimation and monocular scene flow estimation.However,binocular scene flow estimation requires the use of laser sensor equipment or binocular stereo camera to obtain the predicted image,which will result in high cost and complex process of binocular image acquisition,making it difficult to apply in the actual scene.With the rapid development of scene flow estimation,it is possible to recover scene information from the monocular image sequence,and a large number of monocular scene flow estimation networks with practical significance have been proposed and widely used.The monocular scene flow estimation network unifies optical flow and depth into same feature extraction framework for learning in the coding part,which requires extracting rich detailed features and distinguishing useful information from a large amount scene information to achieve more accurate estimation.However,most of the existing monocular scene flow estimation networks propose new loss function to enhance the consistency of optical flow and depth,or design new decoders and network architectures to better iteratively update the optical flow and depth,while ignoring the enhancement of the feature extraction of the network.In order to solve this problem,this paper introduces Convolution Transformer into the feature pyramid layer of the monocular scene flow estimation network,and realizes spatial downsampling by the convolution embedding and convolution projecting with different steps,which reduces the number of feature sequences and increases the feature dimension of sequences,so as to further capture more fine pixel features,and achieves efficient monocular scene flow estimation.Only focusing on the feature extraction of pixels while ignoring other useful information for model is prone to network overfitting problems,and large displacement moving objects and occluded pixels cannot be effectively estimated.Aiming at the above problems,this paper introduces the hierarchical Transformer with relative position coding into the monocular scene flow estimation network to enhance the correlation between adjacent frame pixels for accurate query matching.By inputting the cost volume of calculating the correlation between features into Transformer in a hierarchical manner,the joint local and global attention effectively aggregates information from the cost volume,accurately matches the relative relationship between adjacent pixels,and realizes the focus matching of adjacent pixels.In order to focus on the potential information between long-distance pixels caused by large-displacement moving objects,relative position coding can learn the relative position relationship between different sequences and assign different levels of attention according to the distribution of relative distance,which is conducive to capturing long-distance dependencies.Finally,in the network feature extraction part,Depthwise Over-parameterized Convolution is used instead of traditional convolution to realized feature enhancement to solve the problem of ignoring the extraction of edge feature information.To verify the effectiveness of the proposed method,the optical flow and depth estimation performance of the network are tested on the KITTI datasets.The experimental results show that the proposed method improves the estimation performance of the network and has the obvious competitive advantage.
Keywords/Search Tags:Monocular scene flow estimation, Optical flow estimation, Depth estimation, Transformer, Depthwise over-parameterized convolution
Related items