Font Size: a A A

Cross-modal Deep Mutual-learning For Referring Image Segmentation

Posted on:2022-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:Z W HuFull Text:PDF
GTID:2518306509993019Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
Referring Image Segmentation(RIS)is an important research direction in the intersection of natural language processing and computer vision in recent years.Given an image and its referring expression,RIS aims to segment the image entities described by a natural language expression.As one of the representatives of multi-modal tasks,RIS is applicable to a wide range of scenarios,e.g.,interactive image editing,intelligent visual search,language-based robot controlling and human-robot interaction.Although the task of RIS based on Convolutional Neural Network has made much significant progress,there are still many challenges.Most existing methods do not explicitly formulate the mutual guidance relationship between vision features and language features,which makes it very difficult for the model to learn the consistency between multimodal features.Based on this,we propose two RIS algorithms based on cross-modal deep mutuallearning to improve this problem.The first algorithm proposes a novel bi-directional relationship inferring network to capture the dependencies of cross-modal information.This algorithm uses the bi-directional cross-modal attention module to model the relationship between multi-modal features,and uses the bi-directional guidance mechanism between vision and language to strengthen the consistency of the relationship between cross-modal features,so as to solve the alignment problem of high-dimensional language features and low dimensional visual features.Moreover,a gated bi-directional fusion module is designed to integrate the multi-level features,and uses a gate function to guide the bi-directional flow of multi-level information,which effectively enhances the detailed information of the final segmentation result.The second algorithm constructs a Referring Image Segmentation Network based on Transformer and applies the Transformer structure to the field of RIS for the first time.This algorithm uses the multi-modal features encoder fusion unit based on Transformer to learn the mutual guidance relationship between multi-modal features,which helps the network understand more complex language description.The cross-modal encoder fusion scheme makes the cross-modal deep mutual-learning process take place in the feature coding stage,which can significantly improve the multi-modal features reasoning ability of the network compared with the previous algorithm.In order to further enhance the detail expression of image segmentation results,a multi-level feature fusion unit based on Transformer is proposed to fuse multi-level features to improve the segmentation accuracy.Both algorithms have achieved excellent results on several benchmark datasets.
Keywords/Search Tags:Referring Image Segmentation, Convolutional Neural Network, Bidirectional Relationship Inferring Network, Transformer
PDF Full Text Request
Related items