Font Size: a A A

Research On Multimodal Depth Estimation Method Based On Lightweight Network

Posted on:2023-05-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:T ZhaoFull Text:PDF
GTID:1528307058496874Subject:Instrument Science and Technology
Abstract/Summary:PDF Full Text Request
Depth sensing is the basis for autonomous navigation,localization,and map building.However,the current mainstream depth sensors generally have the problem of limited accuracy or low effective resolution,and the depth sensing technology based on deep learning has the contradiction between model accuracy and complexity.In order to obtain higher quality depth maps using existing depth sensors,this paper investigates multimodal(RGB images and sparse depth measurements)depth estimation methods based on lightweight networks from four aspects: data fusion,convolutional computation,module building,and architecture design.It aims to solve a series of bottlenecks such as low real-time,inconvenient storage,difficult training,and poor flexibility faced by multimodal depth estimation networks from development to application,and provide a reference for the development of depth sensors for unmanned sensing systems.The main research contents and contributions of the paper are as follows.1.Attention-based mechanism of multimodal data fusion method is studiedAddressing the problem of insufficient utilization of a priori information in multimodal data processing by standard convolution,a multimodal depth estimation method based on attention convolution is proposed.From the perspective of data fusion,we explore the problem of balancing model accuracy and complexity and improve the task performance of multimodal depth estimation tasks in terms of accuracy and real-time performance.In this paper,we investigate the mechanism of the role of the attention mechanism in the multimodal data fusion process and conclude that:(i)the attention mechanism can utilize more multimodal prior information by responding to the state of data distribution;(ii)the attention mechanism can suppress discrete-value gradient anomalies in the training process and promote network convergence.The validation experiments based on NYU-Depth-v2 and KITTI-Odometry datasets show that the multimodal depth estimation network based on the attention mechanism can achieve real-time depth estimation on NVIDIA GTX 1060 graphics cards while maintaining high accuracy.2.The spatial-modal feature learning algorithm based on lightweight convolution design is proposedIn this paper,a spatial-modal feature-based learning method is proposed for the first time to address the problem that the two-dimensional convolution operation can only extract features in the spatial dimension,and the importance of modal features for multimodal depth estimation tasks is demonstrated using a three-dimensional convolution model.We study a spatial-modal feature extraction method based on lightweight 3D convolution,which performs 3D convolution operations using 2D convolution kernels to improve model accuracy without increasing the number of 2D convolution parameters.This approach partly solves the storage problem of multimodal depth estimation models caused by introducing additional parameters for improving accuracy.Validation experiments based on NYU-Depth-v2 and KITTI-depthcompletion datasets show that the lightweight 3D convolutional network proposed in this paper can obtain similar accuracy as the standard 3D convolutional network using the same number of parameters as the 2D convolutional network.3.A low-cost and high-precision model acquisition method based on module design is proposed.We propose a multi-stage multi-scale feature extraction module that can effectively utilize computing power resources to address the problem of wasting computing power resources due to feature redundancy based on traditional convolutional modules(ConvBlock and ResBlock).From the perspective of module design,we investigate the problem of balancing the accuracy and complexity of deep learning models,and provide a solution for acquiring high-precision models at low computing power cost.For the characteristics of convolutional neural networks that prioritize low-frequency signals during image training,an edge constraint module that can provide constraints for high-frequency regions is designed to further improve the task accuracy without increasing the computational effort in the inference stage.Validation experiments based on the NYU-Depth-v2 dataset show that the multimodal depth estimation network combining the multistage multiscale feature extraction module and the edge constraint module has a parametric count of about 1MB,and achieves similar accuracy to the current SOTA(State-ofthe-art)method using only a low-cost graphics card(NVIDIA GTX 1060)with 6G of memory.4.A lightweight multimodal depth estimation method for cross-scene tasks is designed.To address the problems of fixed structure and single performance of traditional depth estimation models,we introduce flexibility into the evaluation system of depth estimation networks.From the perspective of architecture design,we investigate a flexible,efficient,and lightweight network design method for improving the adaptability of multimodal depth estimation to different scenarios.Then,to overcome the shortage of the traditional encoderdecoder structure in feature transfer,we investigate the nested network architecture-based skip connection and gated convolution methods,which significantly improve the model accuracy without significantly increasing the number of model parameters.Validation experiments based on the NYU-Depth-v2 and KITTI-Odometry datasets show that the multimodal depth estimation network based on the nested architecture can be disassembled into a series of subnetworks with different parameter sizes and accuracy levels during the testing phase,providing a choice of accuracy mode and speed mode for cross-scene tasks.The sub-network in precision mode achieves highly accurate real-time dense depth estimation on the KITTI-Odometry dataset;the sub-network parameters in speed mode are only 650 KB,and the test rate is up to 200FPS(Frames Per Second)on the NVIDIA TITAN V graphics card.
Keywords/Search Tags:Multimodal depth estimation, convolutional neural network, lightweight network design, module design, network architecture, attention mechanism, data fusion, modal features
PDF Full Text Request
Related items