Font Size: a A A

Research On Scene Understanding Algorithms Based On Graph Neural Networks

Posted on:2021-03-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:A LuoFull Text:PDF
GTID:1368330647960722Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of deep learning,the computer vision technologies based on convolutional neural networks(CNNs)play an indispensable role in scene understanding.As the feature at different stages in CNNs have different characteristics,feature fusion becomes a necessary means to improve the capability of feature representation.However,in certain special tasks,directly fusing deep features and shallow features will lead to the problem of mutual interference.In addition,fusing cross-modal features directly will also lead to information interference.Therefore,the feature fusion of multi-modal data is very important for a complex environment.This dissertation focuses on how to use graph neural network to establish a higherorder model among complex features,so as to perform relational inference and information interaction.In this way,the network can avoid the negative impacts of directly fusing multi-modal or complicated features.The contributions of this dissertation are as follows:In the task of traffic scene density perception,there are two typical problems that hinder precise prediction.First,the density response in ground truth exceeds crowd area.Second,the large variance of response distribution in ground truth occurs in some complex scenes.To alleviate this problem,this dissertation proposes a hybrid graph neural network that formulates crowd density prediction and localization into a joint reasoning model based on graph neural network.This method is the first deep neural network that can explicitly learn and reason about the higher-level relations between crowd counting and its auxiliary task(localization)across different scales through a hybrid graph model.Moreover,the model has the special characteristics of multi-tasking,in which different types of nodes,linking edges and information transmission functions are composed of specific neural network modules.By exploiting this characteristic,the model can precisely make use of the collaborative and complementary information between crowd counting and localization so as to improve perception accuracy.In the task of scene parsing,the existing algorithms only focus on image-specific feature representations while ignoring dataset-level general semantic knowledge.Consequently,they are unable to address the uncertainty caused by illumination variance and occlusion.To alleviate this problem,this dissertation proposes a knowledge augmented neural network that obtains supportive semantic knowledge from whole dataset.Specifically,it employs a graph convolutional neural network to reason about the relations between category-specific features and their co-occurrence representations.In inference process,the proposed efficient knowledge augmentation operator converts the learned dataset-level supportive semantic knowledge into image-specific features so as to enhance the representation of basic features and improve the performance of scene parsing.In addition,to ensure the efficiency of scene parsing network,this dissertation proposes an efficient dual feature abstraction module to build backbones.This module adopts two parallel convolutional branches.One focuses on modeling spatial relations by using depth-wise convolution while the other focuses on modeling cross-channel correlation by using point-wise convolution,which can reduce the number of parameters and computational complexity of the whole network.Then a lightweight coding enhancement module is designed within encoder-decoder architecture to ensure that the extracted features contain both high-level semantic knowledge and spatial details.Finally,the efficient graph based relation inference module is appended on top of network to achieve a good balance between efficiency and performance.To facilitate the cross-domain fusion of appearance and depth features in anthropomorphic salient object detection tasks,this dissertation proposes a cascade graph neural network where a set of graphs are used to establish the higher-order relations among crossmodal features.The cascade graph model consists of several hierarchical graph structures to deal with the cross-modal information reasoning and interaction in each specific stages.The graph model of each stage converts the output results from previous graph into the guiding nodes of specific domains to guide the feature learning in current graph model,which performs the sequential connection and semantic information transfer between the graph models of multiple stages.The experimental results on several public datasets show that the proposed cascade neural network achieves higher accuracy than existing state-ofthe-art methods.
Keywords/Search Tags:Graph neural network, scene understanding, crowd counting, semantic segmentation, salient object detection
PDF Full Text Request
Related items