Font Size: a A A

Research On Technology Of Real-scene Light-field Content Generation By Perceiving Image Depth

Posted on:2023-12-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:H C WangFull Text:PDF
GTID:1528306911495014Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
In human natural activities,the activity subject’s perception of the depth information of real scenes in which it is located is indispensable and plays a crucial guiding role.When a human being perceives the depth information of real scenes,it generally includes how far the objects are from the human being and the distances between objects in real scenes.The perception of the depth information ensures the normal activities of a human being,e.g.,basic activity conditions and behavior.With the development of computer intelligence technology,people try and explore the structural depth information of given real scenes,and even generate real-scene content with the human creations,such as computers.In the field of scene-content generation,the light-field content generation technology has gradually attracted the attention of many researchers for its complete light-field theory and potential applications,such as light-field display and 3D medical treatment.In light-field content generation technology,a researcher first utilizes sparse views of given scenes to perceive or restore the structural depth information of target light field,which is similar to the depth perception of a human being in natural activities.Only by accurately obtaining the depth information of a target scene can researchers reconstruct sparse or dense 3D scene models to accomplish the task of generating 3D light-field content.Inspired by this kind of perception-generation research paradigm,this dissertation carries out the study of perceiving monocular and binocular depth by self-supervised method with Convolutional Neural Network(CNN)which only uses the color images as the supervisory signal of learning model,and proposes a novel and efficient light-field content generation solution based on depth perception.The main contents and innovations in our research are as follows:(1)self-supervised monocular depth estimation based on progressive strategyIn order to solve the over-reliance on ground truth depth data in monocular depth estimation,a novel method of self-supervised monocular depth estimation based on progressive strategy is proposed.When designing the loss functions of the model,we proposed to replace the expensive depth label data with the reconstruction error of a color image,which can achieve the parameter learning of the model with a self-supervision method by only using color images.In addition,on the basis of an image reconstruction loss,we added the depth smoothing loss for ensuring the smoothness on the surfaces of objects in an estimated depth map,and another loss of image structure similarity for assisting the scene structure during the image reconstruction process.This CNN model of self-supervised monocular depth estimation consists of a feature encoder for monocular images and a depth decoder that has four progressive modules.In addition to learning the useful feature information of a single color image,the monocular encoder also estimates an initial coarse depth map of a given scene.In each progressive module,the coarse depth map and a color image with same resolution are input into the current module for optimizing the depth map by filling the details.The current progressive module leverages high-frequency information on the color image to gradually optimize the input coarse depth map,such as trying to restore the edge details and the smoothness of objects’ surface in given scenes,and output a high-resolution depth map at the end of the module.Experimental results show that this CNN model based on progressive strategy can efficiently reduce the problem of blurs at edges of objects on the depth map,and also present a smooth depth estimation on the surfaces of objects near the camera in a given scene.In the test of model generalization,we selected several real-world scenario datasets in the test phase.The results show that the progressive strategy model can also correctly recover the depth information of target objects from a single color image in the new scenes,which performs a good ability of generalization.On the real scene dataset KITTI and Make 3D,the RMSE(Root Mean Square Error)in estimated depth maps of our progressive strategy model reaches 4.524 and 6.460,respectly.(2)self-supervised binocular depth estimation based on bi-directional pixel movement learningIn the task of binocular depth estimation,a novel method of self-supervised binocular-depth estimation is proposed to further improve the accuracy of self-supervised depth estimation This method utilizes binocular images as input signals for perceiving the initial depth-plane features of a given scene by leveraging the bi-directional movement of pixels in the middle-view synthesis task.By perceiving the pixel movement,the proposed CNN model contains two consecutive visual task parts:novel view synthesis task and depth estimation task.In the view synthesis task,we innovatively proposed to leverage the middle view as the supervisory signal in the view synthesis networks.The purpose of middle-view synthesis is to learn the bi-directional movement of pixels on left and right views.The movement on left and right views is opposite,left to middle and right to middle,and achieve the pixel alignment at the middle-view position,then obtain the middle view.After training the middle-view synthesis task,the bi-directional movement information of pixels is saved in the CNN parameters and presented on the depth plane features.In the depth estimation task,we only use a few convolutional layers to extract and fuse the depth plane features learned in middle-view synthesis task,and can obtain the depth map of target scene.The depth estimation task utilized the image reconstruction loss to supervise the learning of convolutional layers,which avoid the usage of expensive ground truth depth data.Experiments show that the bi-directional pixel movement learning model can estimate high-quality depth maps on real-scene datasets,including the ability for estimating the depth information of small objects and edges in a scene.On the real-scene dataset Middlebuiy,the RMSE index of depth estimation is 2.264,and the proportion of single-pixel error is also below 10%.Besides,in the test of model stability,the proposed CNN model also has a stable depth estimation ability when dealing with binocular images at different disparity ranges.At the end,the estimated depth map of this model is applied to the 3D display device and achieved a good 3D effect in displays.(3)Dense view generation based on depth perception and position-guided factorIn the content generation task of real-scene light field,an efficient method based on depth perception and position-guided factor is proposed for generating virtual views at expecting positions.The key of the method is to build the one-to-one mapping relationship between a position factor and a virtual view,so as to generate virtual views at different positions by changing the position factor.In terms of the division of functional modules,the proposed CNN model contains four functional parts:depth estimation network,depth mapping module,consistency checking module and view rectifying network.Depth estimation network uses binocular images as input signals,and by leveraging a middle view to supervise the depth estimation task,and predict a depth map of a middle view.The depth mapping module can obtain a proxy depth map by adjusting depth values of a middle view through a position factor,and then uses a reverse mapping rule to generate a seed-view images at the corresponding position factor.The consistency checking module refers to the geometric position relationship of the seed-view images and filters out the wrong pixels on these images by generating a corresponding mask,which facilitates the learning of next view rectifying network.In the final view rectifying network,the filtered seed-view pair will be fixed on the holes and be corrected the pixel shift on themselves,which can obtain a high-quality view corresponding to a position factor.Experiments show that the proposed CNN model has perfect view generation performance on real-scene datasets,and can generate any views with correct parallax relationship between left and right views.On the Middlebury and Stanford light-field datasets,the SSIM,an evaluation metric for view generation,stabilized above 0.93.Finally,this dissertation verified the light-field display effect of dense views generated by our CNN model on the 3D light-field display device,and we observed the correct and smooth object-occlusion relationship.
Keywords/Search Tags:Monocular depth estimation, Binocular depth estimation, Self-supervised learning, Progressive strategy, Bi-directional pixel movement, Position factor, Dense view generation, Light-field display
PDF Full Text Request
Related items