Font Size: a A A

Research On Image Multi-Granularity Referring Analysis

Posted on:2022-05-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:S QiuFull Text:PDF
GTID:1488306560993399Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Image Referring Analysis combines the research contents of computer vision and natural language processing,which is an important problem in the field of artificial intelligence.It has big potential for applications of human-computer interaction fields,such as intelligent home,unmanned driving and safety monitoring.Image Referring Object Analysis originates from the close relationship between language and vision in daily life.Since people often only pay attention to part of the image,they can describe specific attributes to refer to particular objects,then regions of interest can be obtained by object level referring analysis.In the application with higher requirements for scene understanding,it is necessary to further refine the object bounding boxes to the contour and achieve pixel level referring analysis.Facing the complexity of language and the diversity of visual scene in practical application,how to understand the visual and text information correctly and realize the accurate analysis of the object referred by natural language expression is a key problem to be solved.Consider exiting referring analysis techniques are not accurate enough on object localization and semantic boundary regions,we mainly study on the referring analysis based on the deep neural networks to achieve multi-granularity referring analysis from the object level to the pixel level.The main contributions of this thesis are summarized as follows:(1)For object level referring analysis,we propose an end-to-end referring expression comprehension method based on global information embedding.To solve the problem of inaccurate object localization in the exiting one-stage works,considering the relative position relationship between object contained in the description,we propose to establish the object center location branch to localize the object preliminarily by utilizing the global context information.Also,a cross-modal semantic consistency filter is constructed by using the semantic similarity between the text and the image regions to gradually reduce candidate regions,so as to improve the accuracy of referring detection.Besides,we adopt an end-to-end framework for this task.In this way,the time consumption of candidate boxes generating or storing is avoided,and the object analysis time is greatly shortened.The experimental results show that the proposed end-to-end method based on global information embedding can ensure the speed and accuracy of referring object analysis at the same time.For example,compared with baseline method,the proposed method increases the results of Prec@0.5on test B from UNC dataset from 67.59% to 71.42%.(2)For pixel level referring analysis,we propose a novel image referring segmentation method based on generative adversary learning.To solve the problem that details of target are hard to segment,a generative adversarial network is designed to constrain the distribution consistency between the target space and the prediction results.The adversarial loss is used as an additional measure to supplement the traditional segmentation loss function,so as to make the prediction results have similar distribution with the real results.In this way,we can make full use of the semantic correlation between the pixels in images to help the network identify the undefined objects,so as to optimize the segmentation results.Besides,detail enhancement and semantic embedding structures are proposed.By fusing the multi-resolution shallow features of the network,the representation ability of the network to the object contour and other detail features is enhanced.The fusion of visual feature and predicted confidence graph is used as the semantic information embedding to eliminate semantic ambiguity.The experimental results on four public datasets show that the method based on generative adversary network can effectively improve the accuracy of referring segmentation results.For example,compared with baseline method,the proposed method increases the results of Overall IoU on Google-Ref dataset from 36.92% to41.36%.(3)For pixel level referring analysis,we propose an image referring segmentation network by language-guided gated multi-scale feature fusion.To solve the problem of scale difference between referring objects,we consider that different levels of features in deep neural network play different roles in object representation and design a fusion network based on gate mechanism.The network uses the gate function to filter and fuse the features of different scales,makes full use of the complementarity of multi-scale information in different levels of features,and obtains the data features with both low-level detail information and high-dimensional semantic information.This method solves the di culty of object scale difference from the perspective of feature fusion and achieves the purpose of optimizing the segmentation results.In addition,considering the important guiding role of the referring expression in scale information filtering,language-guided gate mechanism is proposed to ensure the effective fusion of multi-scale features and improve the performance.Compared with state-of-the-art methods,the proposed method achieves better performance on three datasets and the results of Overall IoU on UNC+ dataset is 50.43%.(4)For pixel level referring analysis,we propose a referring segmentation method based on multi-level multi-modal information fusion to solve the problem of modal difference between visual and text data.According to the characteristics of easy understanding and parsing of text,the semantic information of text is used to guide the parsing of image content.This method includes local and global levels and fuses text and image features at word-level and sentence-level.Specifically,the conv LSTM network with attention mechanism is used to fuse word by word and visual information;the graph attention network is used for global level analysis.Based on the context information of text,the relationship between pixels in the multi-modal feature is modeled and reasoned,which provides more effective information for subsequent decoding.The proposed image referring segmentation method based on multi-level multi-modal fusion achieves better performance than state-of-the-art methods and the results of Overall IoU on UNC dataset is 62.47%.
Keywords/Search Tags:Deep Learning, Semantic Segmentation, Object Detection, Referring Object Analysis
PDF Full Text Request
Related items