Font Size: a A A

Research On Deep Learning Networks For Scene Parsing

Posted on:2019-07-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y WangFull Text:PDF
GTID:1368330548499831Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Scene parsing,as a complex fundamental computer vision task,not only needs to detect and segment the different objects arising in the scene,but also needs to recognize what categories these objects belong to,so the core goal of scene parsing is to predict a right class label for every pixel of images,which can be beneficial to a wide scope of intelligent computer vision tasks,such as object detection,robot task planning,self-driving cars,and unmanned aerial vehicle autonomous navigation.In addition,deep learning,as a new branch of machine learning,has achieved a rapid development in recent years,and the feature extraction method which is based on deep learning can effectively simulate the human visual system to obtain the feature information of objects step by step,so it becomes the mainstream method in the computer vision research field.Therefore,the design of deep learning networks for scene parsing has become one of the hot topics in current research.Focusing on the main problems that scene parsing faces,we do researches on the shortcomings of the existing deep learning networks for scene parsing and propose the corresponding solutions.The main contents and contributions of our study are as follows:(1)Effective visual feature extraction and accurate spatial structural learning are two key points for improving the accuracy of RGB scene parsing,although convolutional neural networks(CNNs)have shown great ability of feature extraction,their ability of spatial structural learning is weak.For this purpose,we propose the spatial structural encoded deep networks(SSEDNs)for RGB scene parsing,the embedded structural learning layer organically combines the conditional random fields(CRFs)and spatial structural encoded algorithm,which is able to learn the spatial object distributions of objects and spatial relationships among objects in a more comprehensive and accurate way.Furthermore,the feature fusion layer takes both advantages of the deep belief networks(DBNs)and improved CRFs in a skillful way,which is able to achieve deep structural learning according to the comprehensive semantic information of objects and semantic correlated information among objects produced by the multi-modal features fusion.(2)The two major problems the existing RGB-D scene parsing methods face are how to accurately learn the three-dimensional spatial structural information of objects and how toeffectively fuse RGB and depth image feature information.To solve the above problems,we propose the three-dimensional spatial structural encoded deep networks(3D-SSEDNs)for RGB-D scene parsing,the embedded structural learning layer organically combines the CRFs and three-dimensional spatial structural encoded algorithm,which is able to learn the three-dimensional spatial object distributions of objects and three-dimensional spatial relationships among objects in a more comprehensive and accurate way.Furthermore,the feature fusion layer makes use of the DBNs skillfully to achieve the fusion of feature information from RGB and depth images,thus fully explore the correlation between the visual information from RGB images and the depth information from depth images.(3)There may be loss of feature information in the process of training the(three-dimensional)spatial structural encoded deep networks by means of separation,so we reconstruct the structural learning layer through the long short-term memory networks(LSTMs)and reconstruct the feature fusion layer through the CNNs,then propose the global context information inference deep networks(GCIIDNs)which is suitable for end-to-end,pixels-to-pixels joint optimization and can give full play to each layer's advantages when comparing with the(3D)-SSEDNs trained by means of separation.In addition,on account of the(three-dimensional)spatial structural encoded algorithm can only infer the local(three-dimensional)spatial context information of objects,we skillfully combine four uni-dimensional LSTMs to infer the global context information explicitly in the structural learning layer,thus to learn the long-distance and short-distance(three-dimensional)spatial dependencies among objects in a more comprehensive and accurate way,the long-distance dependencies which represent the relative(three-dimensional)spatial relationships among objects are able to achieve the correct and reasonable prediction of global(three-dimensional)spatial distribution in the scene,while the short-distance dependencies which represent the boundary characteristics between adjacent objects can help to realize the consistency and smoothness optimization of objects' contour appearance.(4)The researches show that the adversarial training method can not only improve the performance of the generative networks through the competition of discriminative networks,but also effectively reduce the overfitting of the generative networks during the adversarial training process.To this end,we take the GCIIDNs as the generative network,and propose the spatial structural inference embedded adversarial networks(SSIEANs)based on theoptimized adversarial training method,thereby organically combine the advantages of the hierarchical features extraction,the spatial structural inference,the multi-modal features fusion and the adversarial training method together.Through the adversarial training,the SSIEANs can not only detect the mismatches between the scene parsing results produced by the generative network and the corresponding ground truth through the analysis and judgment of the discriminative network,but also fine-tune the parameters of each layer of the generative network in an adversarial method through the competition of the discriminative network,thus make full use of the feature extraction layer,the structural learning layer and the feature fusion layer,further improve the semantic consistencies between the scene parsing results and ground truth significantly.
Keywords/Search Tags:scene parsing, deep learning, convolutional neural networks, conditional random fields, long short-term memory networks, adversarial training method
PDF Full Text Request
Related items