| In recent years,referring expression segmentation(RES)has been an important research direction in computer vision and natural language processing.Given image and an expression describing an instance object from the image,RES aims to segment the mask region of the corresponding entity from the image.There are two main architectures in the field of RES: single-stage model and multi-stage model.Among them,the multi-stage model often uses image detection model to select potential candidate bounding boxes and calculates the correlation between candidate bounding boxes and expressions to determine the segmentation target.This scheme is easily limited by the accuracy of the detection model.At the same time,the multi-stage model also has the disadvantages of low reasoning efficiency and lack of real-time performance.The single-stage model usually uses the image semantic segmentation model as the backbone network,and realizes the information interaction between modalities by adding a text encoder and across-model fusion module to complete the segmentation task.The existing single-stage model still fails to solve the following problems well: 1)How to better fuse the multi-scale feature information generated by the encoder in the decoder stage to provide richer feature information for the final segmentation module;2)How to better overcome the difference in feature distribution between different modalities and achieve cross-modal feature fusion.Given the above two problems,this paper conducts the following research:(1)For question 1,this paper proposes a referring expression segmentation method based on deep supervised fusion and feature smoothing.This method coordinates and extracts the global and local information between different fine-grained features through the deep supervised multi-scale feature fusion module.Then,we use textual feature guided different fine-grained features to predict the segmentation heatmap respectively,and predicts the final segmentation mask result based on the segmentation heatmap.The feature smoothing loss function reduces the fine-grained differences in the multi-scale feature fusion and the upsampling process,which further optimizes the segmentation prediction results.(2)For question 1 and 2,this paper proposes a referring expression segmentation method based on the U-shaped Transformer structure.This method designs a cross-modal attention mechanism and applies it to the encoder,which advances the cross-modal interaction to the encoding stage,reducing the feature distribution differences between modalities.Secondly,this method connects the encoder and decoder through a skip connection structure.The shallow features in the encoder are used to provide detailed feature information for the decoder.Finally,the patch prediction module is added to provide additional information for mask prediction to improve segmentation accuracy.The method proposed in this paper is tested on the three datasets of RefCOCO,RefCOCO+,and RefCOCOg,respectively.The experimental results show that our method is superior to other mainstream models in all evaluate metrics,which verifies the effectiveness of the method proposed in this paper. |