Font Size: a A A

Research On Key Technologies Of Vision-based Scene Understanding

Posted on:2022-03-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:1488306725971299Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The goal of vision-based scene understanding is to transform visual data into semantic information,so that the computers are able to describe and summarize visual image scenes.The basic task of scene understanding is to identify the main objects with their position and category,which eliminates the occlusion between objects.Image segmentation and object detection are two foundation techniques in scene understanding.However,the quality of samples in real scenes have been greatly affected by complex lighting,hardware and other conditions such as occlusion and perspective change.In addition,due to the complexity of some scenarios,it is difficult to obtain sufficient samples which are serious challenges to visual algorithms based on machine learning.So it is valuable both in theory and practice for us to research image segmentation and object detection for scene understanding.In this paper,the main researches are image segmentation and object detection,including unsupervised segmentation for natural images,semantic segmentation for RGBD images,and the application of object detection.· For unsupervised segmentation of natural images,the segmentation methods based on affinity graph,the fusion principle of different affinity graphs usually depends on empirical theories such as the area and similarity of superpixels.However,it is difficult to accurately define the local and global nodes according to the simple representation of the features,because the features of superpixels vary greatly in different scales.The linear affinity graph,which is often used in this kind of algorithm,can not make full use of the nonlinear structural information between multiscale superpixels.To solve these problems,we propose an adaptive affinity fusion graph framework to effectively connect different graphs with highly discriminating power and non-linearity for natural image segmentation.The proposed framework combines different graphs according to a new definition named affinity nodes of multi-scale superpixels.These affinity nodes are selected based on a better affiliation of superpixels,namely subspace-preserving representation which is generated by sparse subspace clustering based on subspace pursuit.Then a kernel spectral clustering based graph(KSC-graph)is built via a novel kernel spectral clustering to explore the nonlinear relationships among these affinity nodes.Moreover,an adjacency-graph at each scale is constructed,which is further used to update the proposed KSC-graph at affinity nodes.The fusion graph is built across different scales,and it is partitioned to obtain final segmentation result.Experimental results on the Berkeley segmentation dataset,Microsoft Research Cambridge dataset and Stanford Background dataset show the superiority of our framework in comparison with the state-of-the-art methods.· For semantic segmentation of RGBD images,the atrous/dilated convolution-based methods fail to capture small objects with accurate boundaries.The encoder-decoder models only process the paired complementary cues in the encoder,but ignoring the cross-modal information during decoding.Moreover,training such a model is usually difficult to converge due to this imbalance of the encoder and decoder.The multi-task learning-based methods perform the multi-task distillation at a fixed scale with specific receptive field in the decoder.However,in fact,the influence between two tasks is different for various sizes of receptive field.To solve these problems,we propose a dual supervised decoder framework based on attention mechanism.This framework makes full use of cross-modal complementary information and extracts multi-scale features combined with multi-level receptive field.Moreover,this framework can also consider the relationship of different tasks to improve the segmentation accuracy of RGBD images through task transfer learning.In the encoder,we design a simple yet effective attention-based multi-modal fusion module to extract and fuse deeply multi-level paired complementary information.To learn more robust deep representations and rich multi-modal information,we introduce a dual-branch decoder to effectively leverage the correlations and complementary cues of different tasks.In the main branch of decoder,the multi-scale context is combined by the atrous spatial pyramid pooling module under pyramid supervision.In addition,it is supervised by another task-specify decoder,such as surface normal estimation to improve the performance and training convergence speed.Finally,we explore the task relation through transfer learning in multi-task learning,and propose a more effective two-stage training method to further improve the accuracy of semantic segmentation.Extensive experiments on NYUDv2 and SUNRGBD datasets demonstrate that our method achieves superior performance against the state-of-the-art methods.· Due to illumination change,occlusion and hardware resources,the accuracy and speed of general object detection fail to meet the requirements in the real applications.Most of the early detectors relied on traditional machine learning methods,but these methods have a large amount of computation,which are difficult to meet the requirements of real-time and versatility.General object detectors based on convolution neural networks(CNNs)are not analyzed for the illumination change.In term of limited hardware resources,the CNN-based detectors can not meet the real-time detection speed of more than 30 frames per second,and the model parameters of these detectors are relatively large.For the application of object detection in the wild,we propose an end-to-end light framework to improve detection accuracy while supporting a real-time operation with a low resource requirement.We firstly design a novel lightweight backbone(RFDNet)to improve the accuracy and reduce computational cost.Then,we propose a multi region proposal network using multiscale feature maps generated from RFDNet to improve the detection performance.Finally,we present multi level position-sensitive score maps and region of interest pooling to further improve accuracy with few redundant computations.Extensive experimental results on Image Net,Pascal VOC,and MS COCO datasets suggest that our RFDNet can significantly improve the performance of baseline network with higher accuracy and efficiency.Experiments on six fault datasets of freight trains show that our method is capable of real-time detection at over 38 frames per second and achieves competitive accuracy and lower computation than the state-ofthe-art detectors.
Keywords/Search Tags:Scene understanding, object detection, train image, fault detection, image segmentation, unsupervised, RGBD, semantic segmentation
PDF Full Text Request
Related items