Research On Referring Expression Segmentation Based On Multi-Modal Multi-Scaled Feature Fusion

Posted on:2024-08-15

Degree:Master

Type:Thesis

Country:China

Candidate:W J Yang

Full Text:PDF

GTID:2568307103475544

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In recent years,referring expression segmentation(RES)has been an important research direction in computer vision and natural language processing.Given image and an expression describing an instance object from the image,RES aims to segment the mask region of the corresponding entity from the image.There are two main architectures in the field of RES: single-stage model and multi-stage model.Among them,the multi-stage model often uses image detection model to select potential candidate bounding boxes and calculates the correlation between candidate bounding boxes and expressions to determine the segmentation target.This scheme is easily limited by the accuracy of the detection model.At the same time,the multi-stage model also has the disadvantages of low reasoning efficiency and lack of real-time performance.The single-stage model usually uses the image semantic segmentation model as the backbone network,and realizes the information interaction between modalities by adding a text encoder and across-model fusion module to complete the segmentation task.The existing single-stage model still fails to solve the following problems well: 1)How to better fuse the multi-scale feature information generated by the encoder in the decoder stage to provide richer feature information for the final segmentation module;2)How to better overcome the difference in feature distribution between different modalities and achieve cross-modal feature fusion.Given the above two problems,this paper conducts the following research:(1)For question 1,this paper proposes a referring expression segmentation method based on deep supervised fusion and feature smoothing.This method coordinates and extracts the global and local information between different fine-grained features through the deep supervised multi-scale feature fusion module.Then,we use textual feature guided different fine-grained features to predict the segmentation heatmap respectively,and predicts the final segmentation mask result based on the segmentation heatmap.The feature smoothing loss function reduces the fine-grained differences in the multi-scale feature fusion and the upsampling process,which further optimizes the segmentation prediction results.(2)For question 1 and 2,this paper proposes a referring expression segmentation method based on the U-shaped Transformer structure.This method designs a cross-modal attention mechanism and applies it to the encoder,which advances the cross-modal interaction to the encoding stage,reducing the feature distribution differences between modalities.Secondly,this method connects the encoder and decoder through a skip connection structure.The shallow features in the encoder are used to provide detailed feature information for the decoder.Finally,the patch prediction module is added to provide additional information for mask prediction to improve segmentation accuracy.The method proposed in this paper is tested on the three datasets of RefCOCO,RefCOCO+,and RefCOCOg,respectively.The experimental results show that our method is superior to other mainstream models in all evaluate metrics,which verifies the effectiveness of the method proposed in this paper.

Keywords/Search Tags:

Referring Expression, Cross-modal, Image Segmentation, Deep Supervised Fusion, Transformer

PDF Full Text Request

Related items

1	Cross-modal Deep Mutual-learning For Referring Image Segmentation
2	Research On Referring Segmentation Method With Cross-modal Fusion
3	Research Of Referring Expression Comprehension Based On Visual-Language Cross-Modal Joint Learning
4	Research On Referring Image Segmentation Method
5	Research On Referring Expression Based On Multi Module Attention
6	A Study Of Multi-modal Image Fusion Algorithm Based On Transformer Models
7	RETR:End-to-end Referring Expression Comprehension With Transformers
8	Mobile Robot Environmental Information Perception Based On Deep Learning
9	Research On Point Cloud Semantic Segmentation Technology Based On Cross-Modal Learning
10	Research On Cross-modal High-resolution Image Reconstruction Technology Based On Deep Learnin