Referring segmentation aims to segment relevant visual regions in an image or video based on referring expression.Compared with the traditional semantic or instance segmentation,it can directly segment any region in the image or video according to the language description without relying on the predefined semantic or object class.It can deal with the uncertainty in segmentation more flexibly.As a basic but key technique in the cross field of computer vision and natural language processing,it is widely used in human-computer interaction,intelligent question answering,and robot,etc.Recently,with the rapid development of artificial intelligence technology,more and more attentions have been paid on the exploration of referring segmentation,and referring segmentation has been become one of the important topics in the field of cross-modal information research.Although huge progresses have been made in the field of referring segmentation,there are still many deficiencies and challenges in this task.For example,how to promote the deep interweaving of visual and linguistic information in complex scenes,and how to use weak annotations to achieve high-performance referring segmentation.To address the above difficulties,this thesis gradually studies the problems of referring image segmentation and referring video segmentation based on the deep learning technology.The main contents and innovations of this thesis are as follows:First,a referring image segmentation algorithm based on encoder fusion is proposed to enhance the correlation between multi-modal features at different scales.It uses the co-attention mechanism to embed the language features into the visual encoder,thereby transforming it into a multi-modal feature encoder,and realizing the progressive guidance of language to multi-modal features of different scales.The co-attention aims to learn the co-projection of multi-modal features,which makes them have better semantic consistency in the new feature subspace,so as to emphasize the guiding role of language.In the decoding stage of the network,this thesis designs a boundary enhancement module to strengthen the network’s attention to the target boundary.The module can enable the network to recover a more complete foreground region.Experimental results show that the model significantly improves the accuracy of segmentation.Second,considering the correlation between referring image segmentation and localization,a multi-task network is proposed.First,this network constructs a bidirectional cross-modal attention module,which first utilizes visual guidance to learn the pixel-wise adaptive linguistic context,and then uses the learned languages to guide the update of visual features.Through their interaction,this module realizes the mutual embedding of information from different modalities.Besides,this thesis integrates the local detail contained in the low-level features into the highlevel global features through the bottom-up feature fusion branch.Meanwhile,the segmentation prediction map is used as a gate function to control the message passing in the fusion process,which makes the network focus more on the boundary details of the referred region.Experimental analysis shows that the model achieves an advanced level in both performance and speed.Third,to reduce the labor cost of pixel-level data annotation,a weakly supervised referring image segmentation algorithm based on bounding box annotation is proposed.It first designs an adversarial boundary loss,which can predict the contour of the referred target under the supervision of boundary box.Then the predicted contour is used to filter the region proposal generated by the unsupervised algorithm to construct pseudo labels.To weaken the influence of noisy labels when training the segmentation network,the algorithm uses two networks to mutually select high-confidence labels for each other.These two networks filter out the noise information in the pseudo labels from different perspectives,which alleviates the overfitting phenomenon to a certain extent,thereby enhancing the performance of the segmentation network.Finally,to explore the multi-level semantic context of text,a deeply interleaved two-stream encoding network is proposed.It adopts multiple cascaded transformer modules to extract multilevel linguistic context,and embeds visual-language mutual guidance modules between the linguistic encoder and the visual encoder multiple times.These processes promotes the progressive interaction of multi-level information between the two encoders from shallow to deep,and realizes the deep interleaving of multi-modal features.To enhance the temporal consistency of multimodal information in video sequences,a language-guided multi-scale dynamic filtering module is introduced.It can use language-guided spatial-temporal context to learn a set of positionadaptive multi-scale dynamic filters,and use this group of filters to update the features of the current frame.Experimental results show that the model effectively improves the performance of referring video segmentation through inter-modal and inter-frame information fusion. |