Cross-modal Deep Mutual-learning For Referring Image Segmentation

Posted on:2022-05-29

Degree:Master

Type:Thesis

Country:China

Candidate:Z W Hu

Full Text:PDF

GTID:2518306509993019

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

Referring Image Segmentation(RIS)is an important research direction in the intersection of natural language processing and computer vision in recent years.Given an image and its referring expression,RIS aims to segment the image entities described by a natural language expression.As one of the representatives of multi-modal tasks,RIS is applicable to a wide range of scenarios,e.g.,interactive image editing,intelligent visual search,language-based robot controlling and human-robot interaction.Although the task of RIS based on Convolutional Neural Network has made much significant progress,there are still many challenges.Most existing methods do not explicitly formulate the mutual guidance relationship between vision features and language features,which makes it very difficult for the model to learn the consistency between multimodal features.Based on this,we propose two RIS algorithms based on cross-modal deep mutuallearning to improve this problem.The first algorithm proposes a novel bi-directional relationship inferring network to capture the dependencies of cross-modal information.This algorithm uses the bi-directional cross-modal attention module to model the relationship between multi-modal features,and uses the bi-directional guidance mechanism between vision and language to strengthen the consistency of the relationship between cross-modal features,so as to solve the alignment problem of high-dimensional language features and low dimensional visual features.Moreover,a gated bi-directional fusion module is designed to integrate the multi-level features,and uses a gate function to guide the bi-directional flow of multi-level information,which effectively enhances the detailed information of the final segmentation result.The second algorithm constructs a Referring Image Segmentation Network based on Transformer and applies the Transformer structure to the field of RIS for the first time.This algorithm uses the multi-modal features encoder fusion unit based on Transformer to learn the mutual guidance relationship between multi-modal features,which helps the network understand more complex language description.The cross-modal encoder fusion scheme makes the cross-modal deep mutual-learning process take place in the feature coding stage,which can significantly improve the multi-modal features reasoning ability of the network compared with the previous algorithm.In order to further enhance the detail expression of image segmentation results,a multi-level feature fusion unit based on Transformer is proposed to fuse multi-level features to improve the segmentation accuracy.Both algorithms have achieved excellent results on several benchmark datasets.

Keywords/Search Tags:

Referring Image Segmentation, Convolutional Neural Network, Bidirectional Relationship Inferring Network, Transformer

PDF Full Text Request

Related items

1	Research On Referring Segmentation Method With Cross-modal Fusion
2	Research On The Semantic Segmentation Method Of Image Based On Transformer
3	Joint Extraction Of Chinese Entity Relationship Based On Bidirectional Semantic Learning Model
4	Research On Referring Expression Segmentation Based On Multi-Modal Multi-Scaled Feature Fusion
5	Research On Image Semantic Segmentation Algorithm Based On Convolutional Neural Network
6	Research On Referring Image Segmentation Method
7	Research On Image Segmentation And Case Analysis Algorithm Of Tongue Fur By Using Convolutional Neural Network
8	Research On Image Object Segmentation Based On Deep Convolution Neural Network
9	Research And Application Of Image Segmentation Based On Deep Convolutional Neural Network
10	Research On Image Steganalysis Based On Convolutional Neural Networks