Font Size: a A A

Research On Semantic Scene Parsing

Posted on:2020-07-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:H C ShiFull Text:PDF
GTID:1368330623958161Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Image segmentation is a fundamental task in computer vision,which is widely used in many areas,such as automatic driving,security monitoring,face processing,etc.,and thus,exhibits important theoretical and practical values.With recent advances of artificial intelligence and deep learning technologies,semantic scene parsing has become a new research focus in the image segmentation field.Rather than the traditional foreground/background image segmentation,semantic scene parsing needs to segment out complex semantic objects from a scene image,which faces many challenges.In complex scenes,the huge number of objects,diverse categories of objects,variable sizes of objects and confusing semantics of objects all restrict the development of semantic scene parsing.Therefore,it is urgent to develop effective semantic scene parsing methods for computer vision and artificial intelligence.Based on above considerations,this thesis conducts the research on semantic scene parsing.In this dissertation,we promote the semantic scene parsing accuracy from two aspects,i.e.,optimizing the feature extraction and the classifier design in the semantic scene parsing method.Moreover,this thesis further conducts researches on the multi-modal semantic scene parsing task,to improve the multi-modal information encoding and the cross-modal data interaction.The detailed researches and contributions are sum-marized as follows:(1)We first study on the scale-prediction-based semantic scene parsing to optimize the feature extraction for each single object.Since there are many objects of different sizes in a complex scene,it is easy to result in misclassifications,under-and oversegmentations.To solve this problem,this thesis proposes a scale prediction model and scale-prediction-based scene parsing model.The scale prediction model predicts a suitable parsing scale for each object region and the scale-prediction-based scene parsing model then parses object regions on the predicted scales.The scale prediction model is a fullysupervised model,which is trained with automatically extracted scale labels without any extra manually annotation.Meanwhile,we explore the relationship between scale and object category to further improve the scale prediction accuracy.Furthermore,we propose an object-category-based scene parsing model,which leverages the object category to refine semantic scene parsing results.(2)In order to optimize the feature extraction for multiple objects,we study on the global-local-relationship-based semantic scene parsing.We first propose a context model which learns multiple relationships,such as the co-occurrence between the global scene and each local object region,the co-occurrence among multiple local object regions and the relative position among local regions.Then,such relationships are used as priors to improve the object recognition accuracy.Moreover,the scene classification is incorporated to provide more discriminative priors to promote the semantic scene parsing accuracy.(3)To distinguish confusing appearance and semantics of objects,we study on the category-focusing-based semantic scene parsing.We propose a category focusing model,which leverages multiple classifiers to gradually distinguish confusing objects like human beings.Meanwhile,we explore to use multiple binary classifiers instead of the multinomial classifier to reduce error accumulations.Moreover,to avoid the misclassifications caused by similar scores of confusing objects,a variance-based regularization is proposed to differ the category scores as large as possible.(4)Furthermore,we study on the adaptive-topology-based semantic scene parsing to optimize the topologies among multiple classifiers.Firstly,a densely-connected classifier model is proposed to avoid the problems in serial and parallel classifier models,such as the error accumulation problem.Then,we propose an adaptive connection model to self-adaptively optimize the topologies among multiple classifiers and thus,boost the semantic scene parsing performance.(5)We conduct the research on the key-word-based multi-modal semantic scene parsing to optimize the information extraction for vision and language data.On the one hand,a key word extraction model is proposed to extract key words in language queries to exclude the noise in language.On the other hand,we propose a key-word-based visual context model to exploit the object relationships described in language queries and to map such object relationships to images.(6)We study on the query-reconstruction-based multi-modal semantic scene parsing to optimize the interaction between vision and language.In this research,we propose a query reconstruction model to build a bidirectional interaction between vision and language.The proposed query reconstruction model first predicts a segmentation result from the input image and language query,and then reconstruct the query from the segmentation and image.Through the query reconstruction,the consistency between the segmentation and query is able to be confirmed.For inconsistent segmentations and queries,we propose an iterative segmentation correction method to correct them to improve the multi-modal semantic scene parsing accuracy.
Keywords/Search Tags:semantic scene parsing, image segmentation, deep learning, feature learning, classifier, multi-modal
PDF Full Text Request
Related items